Sneak peek: generalizability in the social sciences

One of my current research papers looks at how social scientists think about the idea of generalizability.  It’s not quite ready for public consumption, but in the meantime I wanted to share some of the interesting papers which have influenced my thinking on the topic.

Mary Ann Bates & Rachel Glennerster.  2017.  “The Generalizability Puzzle.”  Stanford Social Innovation Review.

At J-PAL we adopt a generalizability framework for integrating different types of evidence, including results from the increasing number of randomized evaluations of social programs, to help make evidencebased policy decisions. We suggest the use of a four-step generalizability framework that seeks to answer a crucial question at each step:

Step 1: What is the disaggregated theory behind the program?
Step 2: Do the local conditions hold for that theory to apply?
Step 3: How strong is the evidence for the required general behavioral change?
Step 4: What is the evidence that the implementation process can be carried out well?

Mark Rosenzweig & Chris Udry.  2019.  “External Validity in a Stochastic World.”  Review of Economic Studies.

We examine empirically the generalizability of internally valid micro estimates of causal effects in a fixed population over time when that population is subject to aggregate shocks. Using panel data we show that the returns to investments in agriculture in India and Ghana, small and medium non-farm enterprises in Sri Lanka, and schooling in Indonesia fluctuate significantly across time periods. We show how the returns to these investments interact with specific, measurable and economically-relevant aggregate shocks, focusing on rainfall and price fluctuations. We also obtain lower-bound estimates of confidence intervals of the returns based on estimates of the parameters of the distributions of rainfall shocks in our two agricultural samples. We find that even these lower-bound confidence intervals are substantially wider than those based solely on sampling error that are commonly provided in studies, most of which are based on single-year samples. We also find that cross-sectional variation in rainfall cannot be confidently used to replicate within-population rainfall variability. Based on our findings, we discuss methods for incorporating information on external shocks into evaluations of the returns to policy.

Karen Levy & Varna Sri Raman.  2018.  “Why (and When) We Test at Scale: No Lean Season and the Quest for Impact.”  Evidence Action blog.

No Lean Season, a late-stage program in the Beta incubation portfolio, provides small loans to poor, rural households for seasonal labor migration. Based on multiple rounds of rigorous research showing positive effects on migration and household consumption and income, the program was delivered and tested at scale for the first time in 2017. Performance monitoring revealed mixed results: program operations expanded substantially, but we observed some implementation challenges and take-up rates were lower than expected. An RCT-at-scale found that the program did not have the desired impact on inducing migration, and consequently did not increase income or consumption. We believe that implementation-related issues – namely, delivery constraints and mistargeting – were the primary causes of these results. We have since adjusted the program design to reduce delivery constraints and improve targeting.

Tom Pepinsky.  2018.  “The Return of the Single Country Case Study.”  SSRN.

This essay reviews the changing status of single country research in comparative politics, a field defined by the concept of comparison. An analysis of articles published in top general and comparative politics field journals reveals that single country research has evolved from an emphasis on description and theory generation to an emphasis on hypothesis testing and research design. This change is a result of shifting preferences for internal versus external validity combined with the quantitative and causal inference revolutions in the social sciences. A consequence of this shift is a change in substantive focus from macropolitical phenomena to micro-level processes, with consequences for the ability of comparative politics to address many substantive political phenomena that have long been at the center of the field.

Evan Lieberman.  2016.  “Can the Biomedical Research Cycle be a Model for Political Science?”  Perspectives on Politics.

In sciences such as biomedicine, researchers and journal editors are well aware that progress in answering difficult questions generally requires movement through a research cycle: Research on a topic or problem progresses from pure description, through correlational analyses and natural experiments, to phased randomized controlled trials (RCTs). In biomedical research all of these research activities are valued and find publication outlets in major journals. In political science, however, a growing emphasis on valid causal inference has led to the suppression of work early in the research cycle. The result of a potentially myopic emphasis on just one aspect of the cycle reduces incentives for discovery of new types of political phenomena, and more careful, efficient, transparent, and ethical research practices. Political science should recognize the significance of the research cycle and develop distinct criteria to evaluate work at each of its stages.

IPA’s theory of action for evidence uptake

Picture of a graph with text: Create stronger evidence. Picture of a handshake with text: share evidence strategically. Picture of gears with

(Image source: IPA)

Innovations for Poverty Action recently released their 2025 Strategic Ambition.  One thing that really stood out to me in this document is a much stronger focus on ensuring that research results can actually be accessed, understood, and put into practice by policymakers.  Interestingly, they focus not only on building strong and ongoing relationships with policymakers, but also on encouraging donors to provide funding for the implementation of research-based policies.

I think this is a really important step towards acknowledging that policymakers face lots of constraints in using research results, and we need to move beyond ideas like “hold more dissemination conferences” to overcome them.  Check out the whole list of recommendations below.

screen shot 2018-12-19 at 1.23.57 pm_0

Successfully scaling up cash transfer programs in Burkina Faso

A hand holding about 15 fanned out CFA notes, each worth 10,000 francs

CFA notes, via Young Diplomats

Apolitical recently published a profile of Burkina Faso’s national cash transfer program, which grew out of a pilot funded by the World Bank.  It’s an interesting contribution to the recent discussion about scaling up successful interventions which has been going on at places like Vox and Evidence Action.

One of the main points is that expanding a pilot already run by the government may be more feasible than having the government adopt a program previously run by NGOs.

But the World Bank evaluation did make an important difference to the design of the national policy. One valuable factor was the way the trial involved the government from the beginning, creating expertise among local officials before the national program was launched.

That’s quite unusual, de Walque said. “What you find often is it’s done by some local or international NGO,” he explained, which means the government is less familiar with the program it’s trying to implement.

In Burkina Faso, the cash transfer trial was organised by a senior government official. “The scaling up is more likely to be successful if people from the government use the pilot as a training ground,” de Walque suggested.

As well as involving senior figures from an early stage, the trial created a pool of qualified employees for the early stages of the national program. Local workers who were hired and trained to implement the pilot were top candidates to help launch the policy at scale.

Another takeaway is that it’s likely a pilot program will need to be simplified to be implemented at scale — but understanding how to simplify it is crucial.

Creating this kind of [government] ownership and involvement is valuable because of the way governments inevitably leave out some details from a pilot. “Obviously when you go to a larger scale governments, and probably rightly so, at least in the first attempt, choose more simple programs,” de Walque said.

If the officials in charge have direct experience from the trial stage, they’re more likely to know which simplifications are feasible and which could seriously undermine the program.

Reflections on bringing a promising pilot of an anti-poverty program to scale in Bangladesh


Given my background in development economics and political science, it’s no surprise that I’m excited by the work that Evidence Action does to translate rigorous economic research into policy implementation.  Karen Levy and Varna Sri Raman recently published a remarkably frank blog post discussing the challenges they faced when scaling up an anti-poverty program in Bangladesh after a successful pilot.  The post stood out to me not only for its honesty about the difficulties of implementing at scale, but also for the amount of thought that EA and its implementing partner put into diagnosing and correcting the problems at hand.

The intervention at hand was the “No Lean Season” program.  In a pilot project, Gharad Bryan, Shyamal Chowdhury, and Mushfiq Mobarak gave rural residents small subsidies to temporarily migrate to cities to look for work during the hungry season before annual harvests.  They found that this substantially increased consumption in the sending households.  It’s a clever response to the shortage of non-farm employment opportunities in rural areas, and also demonstrates how even small costs can prevent people from accessing better-paid opportunities elsewhere.

EA’s Beta Incubator subsequently worked with a Bangladeshi NGO to expand the subsidy program from about 5000 households per year up to 40,000.  It was switched from a pure subsidy to a loan in the process  However, they found that the NGO employees who were supposed to deliver the loans handed out fewer than expected.  In addition, the loans didn’t seem to have the same effect, as recipients didn’t seem much more likely to migrate than a comparison group which didn’t receive any money.

The section of the EA post that’s really worth reading is the analysis of why the scaling didn’t go according to plan.  It stood out to me for its use of both qualitative and quantitative methods to better understand the newly scaled-up context in which the program operated, and the internal operations decisions of its partner NGO.  Among the salient points, they found that the program had been expanded into new districts which had much higher baseline rates of migration than the district in which it was piloted.  A miscommunication with the NGO also meant that employees had performance targets set for the number of loans to disburse which were lower than the program actually required.

This is arguably the best example I’ve ever seen of why questions about the external validity of social policy RCTs are beside the point.  Any program has to be adapted to its local context — and that context can vary significantly even at different scales of implementation, or between different districts in the same country.

Seeing like the Somali state

Crumbling colonial-era forts sit on the edge of a bright blue bay filled with small blue fishing boats

Image of Mogadishu via NPR

Jesse Driscoll recently wrote a fantastic post for Political Violence at a Glance reflecting on survey work as statebuilding in Somalia.  This was drawn from his experience doing one of the country’s first representative surveys in decades for this paper.  One important point is that survey work is never positionally neutral — and this lack of neutrality is amplified in a conflict zone:

Our discouraging conclusion, after a 5-year study, was that practically any kind of intervention that touched the lives of Somali’s most vulnerable would invite skepticism about researcher motives—and perhaps rightly so. To the extent we were neutral observers we could be accused of engaging in virtual poverty tourism. To the extent we were something other than neutral observers, however, we were aspirational partisans. One of our Somali enumerators once asked, point blank, if we were being funded by the US military to put together a predator drone list. We weren’t, of course, but his concern was valid. Some of the most productive research programs in political science over the last decade produce knowledge that is explicitly (and unapologetically) seek-and-destroy.

Census knowledge in particular is not a public good, in the economic sense of the term:

An inaugural survey of a landed population after a civil war is not a pure public good, but more akin to club goods for politically powerful social groups (who stand to benefit most from counting and will, predictably, design survey/census categories to benefit them). Residents inclined towards distrust of political centralization may wish to remain invisible.

(The title here, if you didn’t catch the reference, is a play on statebuilding scholar James Scott’s most famous work.)


How do Indonesian policymakers seek out research?

Ajoy Datta had a good post at Research to Action recently about how Indonesian policymakers interact with research evidence.  Here are some of his key points.  First, policymakers are interested in evidence, but they tend to look for data rather than papers initially:

Our results show that when mid-level Indonesian policymakers in both large ‘spending ministries’ and smaller ‘influencing ministries’ are tasked with, say, developing or revising a regulation or law, their first priority is to acquire not research, but statistical data. Seen as objective, policymakers feel data will, for instance, identify current trends, recognise issues that need to be addressed, assign targets, and/or demonstrate impact.

However, the reality is that some policymakers find it difficult to access high-quality data, while others struggle to make sense of the huge volume of data that exists. Data on its own fails to show  the causes of trends and does not point to potential solutions. This is where research can help.

Second, if policymakers want more context for the data they find, they’re fond of inviting experts in for discussions:

Most importantly, however, when policymakers did seek out research, rather than commission or read comprehensive research papers, they are more likely to invite experts they already knew to provide advice through social processes (which some policymakers consider as research). These processes usually feature formal and informal meetings or phone conversations, focus group discussions (FGDs), or seminars.

Part of this is because of constraints on the ability to either rapidly access existing research, or commission new papers on specific topics:

Procedures to procure research from internal research and development units, where they exist, is lengthy and cumbersome. This usually discourages them from making a request at all. In any case, these internal units often lack the capacity to produce high-quality research. Meanwhile, other procedures constrained policymakers from hiring top-end researchers from outside government to undertake research.

The main takeaway is that the social process of building trust between researchers and policymakers matters a great deal.  This certainly poses a challenge for academics, as creating these relationships takes time, and unfortunately doesn’t count towards one’s tenure packet.