On big data, machine learning, and econometrics

According to an analysis by The Economist on keywords in working-paper abstracts by the National Bureau of Economic Research, big data and machine learning have become the latest fad among economists.

As with previous fads in the field, there is that inherent risk that comes with the initial hype, where practitioners first implement the new techniques and ask the questions later on whether it actually makes sense to use them. The field’s quantitative toolbox increases, but so does the risk of malpraxis and of mediocre empirical work. But big data techniques have useful and important implications to economic research. And it is hard to make the case that nobody knows what they are doing when you have several prominent economists–such as Hal Varian, Matt Taddy, and 2017’s Jean Jacques Laffont Prize winner Susan Athey–focusing their research on creating a seat for economists on the big data table. Big data, put simply, is the term used for referring to considerably large datasets (about two terabytes and over). Machine learning, also simply put, is the act of giving computers the ability to learn as they go. Of course, for a machine to learn it needs a lot of practice, or, in other words, a lot of data, sometimes big data. Machine learning started off as a subfield of computer science, and it manifests itself, along with big data, in economics mainly through the field of econometrics. It is not a surprise, for example, that Varian is the Chief Economist at Google, Taddy is a Principal Researcher at Microsoft Research, and Athey is a Consultant Economist for Microsoft. Some background in mathematics, statistics, and modeling, as well as coding skills are necessary, which implies that as students of economics, there might be a high entry cost for us to properly master this new fad. Machine learning and econometrics are complements and not substitutes. The former has a focus on prediction; the latter has a focus on finding causality. Economists are good at small data inference, while machine learning can be very good at big data prediction. Such gap is what people like the economists previously mentioned are trying to minimize. The list of “new tricks” for econometrics includes from nonlinear estimation using regression trees or neural nets to the more-or-less familiar bootstrapping method. Here, however, we will only talk about one in particular: variable or model selection. Machine learning model selection is a data-driven method, where the data tells what variables out of your list of covariates are important. Rather than running your usual regression, you run a regularized regression, where parameter estimates of the “not important” variables shrink to zero. It is worth mentioning that model selection requires (sometimes very) large data sets that are of large dimensions. Nowadays, for example, some empirical questions use datasets with more variables than observations, such as determining which genes are influential on producing a given disease. So where do you fit in all of this? For this I borrow one example Susan Athey used in a recent podcast with EconTalk. Suppose you want to test whether a policy intervention has an effect or not on a given outcome. You dispose of a considerable dataset that includes a large list of possible covariates that you can use as controls. You are only interested in the treatment variable, but also want to control for spurious correlation. Rather than trying to fit as many covariates as possible and running the risk of overfitting the model, you first run a LASSO regression (which is the most popular method in terms of model selection) to determine which control variables are relevant to then include them in your model. Athey raises the warning that these variables are not up for interpretation. Your treatment variable can be given a causal interpretation because there is supposed to be a theoretical framework behind it. Your control variables, however, are not up for that type of interpretation. At the end you have a model based on the data itself rather than on factors picked by the practitioner.

By Jose Alvarez