Evaluation Strategies


Strategies for supporting causal claims in program evaluation (January 2011)

See also:

There is a lot of focus on getting good measures of outcomes of interest—health, income, governance, and so on. However, no matter how good your measures are, by themselves they don’t establish impact. To get at impact you need some strategy for causal inference.  What strategies are available?

There has been a big focus recently on the role of randomized controlled trials (sometimes called field experiments, randomized interventions, RCTs) in program evaluation as a key strategy. Randomized controlled trials are often given pride of place because the random assignment of groups to treatment and control means that the only systematic difference between treated groups and control groups is the fact of treatment; looking at control outcomes then tell you something about what things would have looked like in treatment areas had they not bee treated.

But there are other approaches. The key aim is to be able to make a valid comparison between areas that got a program (treatment) and some set of comparison areas (controls).  In the statistics literature there are a number of “quasi experimental” approaches that are used to try to do this even when randomization is not possible. They all come with different strengths and weaknesses. Here is a list of 7 major approaches (randomization and six others) with some brief reflections  on each (note a very good and free online (it seems), guide to a number of these approaches is here http://www.eurospanbookstore.com/display.asp?K=9780821380284&m=27&ds=dev…):

  1. Randomized control trials (RCT). The basic idea of an RCT is that you use some form of a lottery to determine who, among some group, will or won’t get access to a program (or perhaps who will get it first and who will get it later, or who will get one version and who will get another). The elegance of the approach is that it uses randomness to work out what the systematic effects of a program are. The randomness reduces the chance that observed relations between treatment and outcomes is due to “confounds”—other things that are different between groups (for example one might be worried that things look better in treatment areas precisely because programs choose to work in functioning areas, but knowing that the selection was random removes this concern). It cannot be used always and everywhere however, both for ethical and practical reasons. But it can be used in many more situations than people think! Jeremy Weinstein and I have a general discussion of some of the strengths and weaknesses of the approach here (http://www.columbia.edu/~mh2245/papers1/HW_ARPS09.pdf )
  2. Natural experiments. Another approach is to look for something that happened naturally but that worked in a similar way to randomization. For example say that seats in a school are allocated by lottery. Then you might be able to analyze the effects of school attendance as if it were a randomized control trial. One clever study of the effects of conflict on children (by Annan and Blattman http://www.chrisblattman.com/documents/research/2010.Consequences.RESTAT…) used the fact that the LRA in Uganda abducted children in a fairly random fashion. Another clever study on DDR programs (by Gilligan, Mvukiyehe and Samii http://www.columbia.edu/~cds81/docs/bdi09_reintegration100701.pdf ) used the fact that an NGO’s operations were interrupted because of a contract dispute, which resulted in a “natural” control group of ex combatants that did not receive DDR programs.
  3. Before / after comparisons. This is often the first thing that people look to to work out causal effects. Here you use the past as a control for the present. The approach is not that reliable however in changing environments because things get better or worse for many reasons unrelated to the programs. In fact it is possible that because of all the other things that are changing that things get worse in a program area even if programs have positive effects (so they get worse but are still not as bad as they would have been without the program!). A more sophisticated approach than simple before after comparison is called “difference in differences” – basically you compare the before-after difference in treatment areas with those in control areas. This is a good approach but you still need to be sure that you have good control groups. See a discussion here (http://ec.europa.eu/regional_policy/sources/docgener/evaluation/evalsed/…)
  4. Controlling / Matching –This is perhaps the most common approach used in applied statistical work. The idea is try to try to use whatever information you have about why treatment and control areas are not readily comparable and adjust for these differences statistically.  It works well to the extent that you can figure out and measure the confounds, but is not good if you don’t know what the confounds are. In general we just don’t know what all the confounds are and that exposes this approach to all kinds of biases.
  5. Instrumental variables – this is a tricky one: the idea is to try to find some feature that explains why a given area got the program but which is otherwise unrelated to the outcome of interest; such a feature is called an instrument. For example say you are interested in the effect of a livelihoods program on employment, and say it turned out that most people who got access to the livelihoods  program did so because they were a relative of a particular program officer. Then, if there were no other ways that being a relative of this person could be related to job prospects, then you can work out the effect of the program by working out the effect of being a relative of this individual on job prospects. This has been a fairly popular approach but some of the enthusiasm for this has died a bit, basically because it is hard to find a good instrument. One smart application to look at the effects of poverty on conflict used rainfall in Africa as an instrument for income/growth. While there are worries that the correlation between conflict and poverty may be due to the fact that conflict might cause poverty, it does not seem plausible that conflict causes rainfall! So using rainfall as an instrument here gave a lot more confidence that really there is a causal, and not just correlational, relationship between poverty and conflict (http://www.econ.berkeley.edu/~emiguel/pdfs/miguel_conflict.pdf).
  6. Regression discontinuity approach. This is the most underused approach but it has a lot of potential. This works as follows. Say that some program is going to be made available to a set of individuals. Ex ante we identify a pool of “potential beneficiaries” that is twice as large as the targeted beneficiary number. These potential beneficiaries are all ranked on a set of relevant criteria, such as prior education levels, employment status, and so on. These criteria can be quantitative; but they can also include assessments from interviews or other qualitative information. These individual criteria are then aggregated into a single score and a threshold is identified.  Candidates scoring above this threshold are admitted to the program, while those below are not. “Project” and “comparison” groups are then identified by selecting applicants that are close to this threshold on either side.  Using this method we can be sure that treated and control units are similar, at least around the threshold. Moreover, we have a direct measure of the main feature on which they differ (their score on the selection criteria). This information provides the key to estimating a program effect from comparing outcomes between these two groups. The advantage of this approach is that all that is needed is that the implementing agency uses a clear set of criteria (which can be turned into a score) upon which they make treatment assignment decisions. The disadvantage is that really reliable estimates of impact can only be made for units right around the threshold. For two interesting applications see here (Manacorda et al on Venezuela: http://www.econ.berkeley.edu/~emiguel/pdfs/miguel_uruguay.pdf) and here (Samii on Burundi: http://www.columbia.edu/~cds81/docs/burundi/samii10_bdi_ethnicity_army10…).
  7. Process tracing – this approach tries to establish causality by looking not just at whether being in a program is associated with better outcomes but (a) looking for steps in the process along the way that would tell you whether a program had the effects you think it had and (b) looking for evidence of other outcomes that should be seen if (or perhaps: if and only if) the program was effective. For example not just whether people in a livelihoods program got a job but whether they got trained in something useful, got help from people in the program to find an employer in that area, and so on. If all these steps are there that gives confidence that the relationship is causal and not spurious. The problem though is that you might not know the right set of steps between a program and an outcome, a program may have positive (or negative) effects through lots of processes that you don’t know anything about. Moreover all of these small steps and attributions present causal inference challenges of their own. The process tracing approach can of course be combined with any and all of the other six approaches.

So there is a bigger menu of options than people often think. It’s probably worth noting also that while these are mostly used in quantitative work, really they are different strategies to allow you to make meaningful comparisons and each strategy can be used with qualitative approaches or a mixture of quantitative or qualitative. A key point here is that your measurement strategy is quite distinct from your inference strategy; people focus a lot on the former but without the latter you can’t start making statements about causality.