Getting standard errors right is important for anyone trying to do quantitative impact evaluation. We want to know that any impacts of the project that we observe are real, rather than just the result of random variation in the data. In this blog post, Jonathan Lain focuses on one particular aspect of calculating standard errors that has proved a real headache for the evaluation team at Oxfam: clustering.Issues around clustering arise because the error terms from our regressions or matching models – the stuff we don’t observe, but which determines our outcome variables – may be correlated within particular groups. These groups could be villages, communities, districts, and so on. Ordinarily, selecting these groups is the first step in our sampling strategy. In other words, there are certain things that we can never hope of capturing in our questionnaires, which make people who live near each other alike. This could be about social norms, network effects, or certain types of people sorting into the communities that they like.
Lots of statistical inference is based on the assumption that the error terms are independently and identically distributed (iid). When ‘intra-group’ correlations occur, this fundamental assumption may not hold up. We run the risk of underestimating our standard errors and, in turn, we might end up mistakenly rejecting the null hypothesis that the project had no impact. In other words, we might wrongly conclude that projects have certain effects which are just down to random variation in our data.
At first glance, you might wonder what we’re worried about. In Stata, we can use the <cluster> option, which spans all sorts of commands, so we should just add that to the options for our estimator and the problem goes away. Even if we want to do more complicated clustering strategies, such as two-way clustering (e.g. over space and time) commands like <ivreg2> have this functionality.
The first problem we face is that we often have too few clusters. Sometimes, we only go to 15 or 20 communities in our impact evaluations, across the treatment (intervention) and control (comparison) groups. Cameron and Miller’s guidance on cluster-robust inference suggests that, with this number of clusters, normal corrections for clustering might actually lead estimates for our standard errors to be biased downwards even further than if we just ignored clustering altogether. This would lead us to over-reject true null hypotheses, such that we may classify some treatments as having had a statistically significant effect on the outcome variable when this is not the case.
For some estimators, there are ways out of this problem. Cameron, Gelbach, and Miller‘s wild cluster bootstrapping – courtesy of <cgmwildboot> and <cgmreg> in Stata – might offer a solution for analyses using linear regressions. But in Oxfam’s impact evaluations, we have mainly been estimating treatment effects using propensity score matching (PSM). Since wild cluster bootstrapping relies on being able to calculate the residuals from a regression, this way out seems closed off to us.
Our solution has been pragmatic. For PSM, we’ve typically used some form of bootstrapping (which simulates resamples using the original data) to estimate our standard errors anyway. The question is then whether we should resample using clusters or not. Based on the ‘too few clusters’ problem, in recent impact evaluations, we’ve tended to bootstrap by resampling at the level of each data point (normally the household) rather than by resampling the clusters (cluster bootstrapping). Non-cluster bootstrapping, rather than cluster bootstrapping, has tended to produce estimates of the standard errors that are larger. So, we’re less likely to conclude Oxfam projects had true effects, which were really just down to random variation in the data – we’re erring on the side of being conservative about impact.
Our pragmatic approach to the question of clustering is, of course, not perfect. But even if our standard errors aren’t estimated with total precision – given the nature of the data we collect – we believe this isn’t biasing our statistical inference too much. However, if anyone has better ideas about how to correctly estimate standard errors on data with few clusters using matching estimators, it’d be great to hear from you…