I’ve got the power! Calculating statistical power for matching models by simulation

Jonathan Lain ICT4D, Real Geek

Calculating statistical power – or working out how many people to interview before a survey – can be a challenge, especially if we want to use matching models to estimate projects’ impacts. In this blog post, we discuss how computer simulations might make power calculations easier and explain how we’ve tried this out in the evaluation team at Oxfam.

How many people should we interview? That’s a question I get asked a lot, and ask myself even more. Choosing the right sample size presents impact evaluators with a potential pitfall even before questionnaires have been printed (or plugged into mobiles), enumerators have been trained, and that ‘pre-survey’ fear that everything might go wrong has set in – I assume it’s not just me that feels like this!

We want to make sure we have a large enough sample to pick up the true effects of the project, but doing surveys takes time and costs money, neither of which come in infinite supply. Statistical ‘power analysis’ presents an appealing solution to this problem. By ‘power’, we mean the probability that we correctly reject the null hypothesis (typically the hypothesis that the project has no effect) when it is, in fact, false. In the jargon, power is the probability of not committing Type II error – the mistake that occurs when our data suggest that an effective project had no impact. Stata has plenty of commands that can do these calculations for us (see <sampsi>, <power>, and so on). All we need to do is feed in our best guesses for the effect of the project/programme, the variance of this outcome variable, and the sample size, and, hey presto, we know what statistical power we would achieve.

Of course, things aren’t quite so simple. Typical power analysis is generally designed for randomised controlled trials (RCTs) where the intervention is randomly assigned to certain individuals, households, or communities. This makes it possible to find closed-form equations (i.e. things we can actually write down) that link power to the effect size, variance, and sample size, so the calculations that Stata needs to do for us are relatively straightforward.

Rather than going to collect the dataset, we can use simulations based on reasonable assumptions about the data-generating processThis problem is not new, nor is it specific to Oxfam’s work

However, in the evaluation team at Oxfam, we often calculate treatment effects – the estimated impact of the project – using propensity score matching (PSM), a non-linear estimator. Suddenly, the nice closed-form expressions we had for calculating statistical power don’t look so nice (in fact, I think, they might not even exist!)

The question is: can we do power analysis without them? In fact, is there any way we can conduct power analysis for PSM?

For example, this post on the World Bank Development Impact blog deals with a number of possible solutions, discussing, in particular, the importance of assessing the comparability of the intervention and comparison (control) groups, and exploring how best to use this information to adjust traditional sample size calculations. At Oxfam, however, we have been wondering if there are other alternatives…

We could imagine one very time-consuming way of conducting power analysis, as follows. Suppose we could go to the field in a setting where the null hypothesis is in fact false and the project does have an effect – for example, perhaps income is $10 per month higher amongst the intervention group compared to the comparison group. If we went to the field 100 times, and collected a sample of the same size from the same population each time, and analysed our data in exactly the same way, we could determine the statistical power of our method. It would simply be the proportion of times we were able to detect a statistically significant difference between the intervention and participation groups. If we wanted to go to the field for a 101st time, we would know, ex ante, the probability of not committing Type II error.

Clearly, calculating power in this way is impossible and totally pointless – if we can afford to go to the field 100 times, then we can probably afford big samples, randomisation, panel data, and all sorts of other things that make statistical analysis clean and simple. However, what if we could use a computer to simulate going to the field 100 times? Rather than actually going to collect the dataset, we can use simulations based on some reasonable assumptions about the data-generating process to create it. This is exactly the methodology that we have been experimenting with. Technical details of this approach can be found here.

There’s just one problem. Remember those ‘best guesses’ for the effect size and the variance of the outcome variables mentioned above? We are going to need even more of them now, because each simulation relies on a set of parameters linking the propensity score, treatment status, and outcome variables, with a pre-determined level of noise. This is a non-trivial problem, because the accuracy of power analysis by simulation is going to depend a lot on how well we can recreate the true data-generating process for the population we plan to evaluate.

Fortunately, Oxfam has a number of previous surveys on which it can draw, to give a decent idea of what these parameters might be. Of course, contexts change substantially over time, and it is very rare that Oxfam returns to exactly the same location to do an impact evaluation. Nevertheless, we believe there may, at least, be some value in doing power analysis by simulation. This ‘do file‘ – which draws heavily on the paper by A. H. Feiveson in the Stata Journal – shows how we have tried to implement this in Stata.

Given the widespread use of PSM (and other non-linear estimators) in evaluation work, I’m sure there are others grappling with this kind of question too. It’d be great to get some comments on this approach, to see if it works and find out how we can improve it.


Charikleia Poucha