Causal Identification Strategies which Rely on Control Variables
Against claiming your results are causal but only when you add a control variable
A common feature of many well-known applied econometrics and causal inference papers is an identification strategy that uses a standard causal inference tool (e.g., Regression Discontinuity Design, Difference-in-Differences, or Instrumental Variables) but achieves identification only with the inclusion of one or more control variables. I think this approach is flawed both in a this doesn’t feel quite right kind of way, but also in a more formalisable way which I’ll try to explain in this post.
To give you an example, consider this highly influential paper from Nunn and Wantchekon (2011)1 which was published in the American Economic Review. It studies the long term effect of the slave trade on levels of trust across different ethnic groups in Africa. To instrument for historic exposure of ethnic groups to the slave trade, it uses the historical distance of the ethnic groups from the coast of Africa during the period of slavery. The idea behind the instrument is that the distance of ethnic groups to the coast clearly affects the chances of them being impacted by the slave trade, while this geographic factor is ‘plausibly’ uncorrelated with factors, other than the slave trade, that may have affected how trusting the ethnic group is today.
The second assumption, the exclusion restriction, is where things become complicated. The authors acknowledge a number of potential violations of the exclusion restriction but argue that they can include control variables to remedy this.
See the paragraph below for the key excerpt:
Despite this fact, there remain a number of other reasons why the exclusion restric tion may not be satisfied. First, distance from the coast may be correlated with other forms of European contact, like colonial rule, which followed the slave trade. For this reason, we only report IV estimates after controlling for our full set of ethnicity-level colonial control variables. Second, locations closer to the coast were more likely to rely on fishing as a form of subsistence. Although it is not obvious how this may affect future trust, to be as thorough as possible we control for ethnicities’ historical reliance on fishing. Third, for some parts of Africa, proximity to the coast implies greater distance from the ancient trade networks across the Sahara Desert. Because long-term trust may have been affected by a group’s involvement in this inland trade, we also control for the average distance to the closest city in the Saharan trade, as well as the average distance to the closest route of the Saharan trade.
Nunn and Wantchekon (2011), (p. 329)
Essentially, the paper includes a huge number of control variables to get round potential violations of the exclusion restriction. I have a lot of problems with this.
Firstly, it’s not really in line with the spirit of applied causal inference. In my view, a defining feature of the causal inference toolkit, the thing that makes it different to endogenous regression correlations, is that you don’t need control variables for causality. This is important because adding control variables is an enormously costly thing to do. If you argue that a control variable is essential to identification, you have to also argue that the selection of control variables you have included is just right. Despite the theoretical infinite set of possible necessary control variables, you need to argue that you have included exactly the correct ones—not too many nor too few. And that’s not to mention the problems of bad controls, overfitting and p-hacking which also need to be accounted for when including control variables in a regression.
This is why economists always look for exogenous variation and use causal inference techniques like instrumental variables, instead of just running regressions on endogenous variables and controlling away all of the endogeneity. If you can find a truly exogenous instrument, then you should be able to present baseline IV estimates with no control variables and get round all of these problems. But if your instrument is only exogenous after including 20 control variables, then you are suffering from the exact same problem as the person running the endogenous regression.
A more formal way of thinking about this is that, the more control variables you claim are essential for identification, the more likely it becomes that2:
There is an additional control variable that you have missed
You have included a bad control/ collider
You have overfitted your model
Your results are a consequence of p-hacking
If you claim that your results are causal from your instrument and just one or two obvious control variables, then most of these concerns go away. But if, like Nunn and Wantchekon argue, you need to include 20 control variables just for identification, then it’s pretty hard to argue that there isn’t a 21st essential variable that you didn’t think of or didn’t have data on. Of course, you may wish to argue that you have been conservative and included many more controls than you really need, but in that case I think it’s important to present the specification which includes the fewest possible control variables while still being identified.
That thought brings me on to the problem of p-hacking. If you say that you need 20 control variables to achieve identification, it’s hard to believe that those are the only 20 control variables you ever tested and that you only ever ran one regression on those variables. When there are lots of control variables in your regression, the number of possible combinations of regression specifications increases fast. It becomes pretty much guaranteed that you tried a few other combinations of control variables and decided not to report the results. In contrast, if there is just a single source of exogenous variation and no control variables, the possibilities for p-hacking are diminished.
Like I already said, if there are many control variables, it also becomes possible that some of them are actually bad controls that introduce bias rather than remove it. If there’s some chance that each of your controls is bad, then the chance that one of your controls is a bad control increases as the number of controls you include increases.
In other words, control variables are a really tricky business and more controls means more tricky business. Using the causal inference toolkit to avoid running in to these problems is wonderful. But if you use causal inference techniques and still need to use control variables then you haven’t actually overcome the main problem you were trying to solve, and are still left with the almost impossible task of removing all endogeneity from your regression via a perfectly chosen set of control variables.
Conclusion:
If the published papers of the last 20 years are anything to go by, economists love causal inference techniques and don’t like endogenous regressions with control variables. The point I want to make in this post, however, is that the lines between the two are often blurred. Many causal inference papers rely not on exogenous variation, but on variation which becomes exogenous only after controlling for additional variables chosen by the researcher. These are fundamentally different approaches and yet they are often grouped together.
In that sense, I think economics, quite ironically, may have suffered from a case of Goodhart’s law. Causal inference has become the standard of modern papers. OLS regressions are very rarely published in top journals. But satisfying the assumptions of causal inference techniques is hard, and so economists came up with ways of calling their results causal while using effectively the same techniques with the same drawbacks as OLS regressions. They look at the causal identification assumptions, realise that they don’t hold, quietly add a set of control variables and then call their results causal throughout the paper while downplaying how control variables are essential for identification.
I appreciate that this is quite a pessimistic view, but I think it does highlight a genuine concern in academic economics. It’s important to say, however, that I don’t think it crushes the entire field of applied micro-econometrics3. The papers I’m talking about, including Nunn and Wantchekon (2011) are much more than just a table of IV estimates. That particular paper independently presents a huge number of compelling arguments for its conclusions. In that context, the IV estimates are believable.
Learning things in economics is about combining your qualitative prior knowledge of the world, with formalised mathematical theories and a whole corpus of empirical findings with varied specifications and robustness tests4. There is certainly some usefulness to having identification strategies which include control variables somewhere in that corpus, but they should never be viewed as definitive in the absence of other forms of evidence.
This is a beautiful paper and I’m hesitant to criticise it so much in this blog post. It just happens to perfectly illustrate my point—a point which in fact applies to an enormous number of other well-regarded economics papers.
For what it’s worth, while I remain sceptical of the empirical validity of the IV estimates, I think these results and the overall conclusion of the paper are directionally correct. The paper includes a large body of empirical tests, alternative specifications and historical evidence, which should be taken together to form a perspective on its conclusion.
Well I hope it doesn’t since this is the field of economics that I want to be a part of
I’m currently formalising exactly what I mean by this. Hopefully I’ll publish it one day.

