What’s Scary about Big Data?

Here is a link to Hal Varian’s latest paper, which I haven’t read. Here is Kling:

When confronted with a prediction problem of this sort an economist would think immediately of a linear or logistic regression. However, there may be better choices, particularly if a lot of data is available. These include nonlinear methods such as 1) neural nets, 2) support vector machines, 3) classifi cation and regression trees, 4) random forests, and 5) penalized regression such as lasso, lars, and elastic nets.

In one of his examples, he redoes the Boston Fed study that showed that race was a factor in mortgage declines, and using the classification tree method he finds that a tree that omits race as a variable fits the data just as well as a tree that includes race, which implies that race was not an important factor.

I’m no fan of linear regression but at least with linear regression I know what assumptions I’m making. Multivariate linear regression is getting into scary territory but I retain the barest of grips.

Neural nets, which I’ll admit I’ve only done in a classroom setting, are a black box. Take the “proof” that Varian offers that the Boston Fed study was wrong-footed: “I can make a model with different variables that interact in ways I don’t understand that works!”. I say, holy cow, how do you have any idea you’re right.

Now I’ve not read the Boston Fed study and I haven’t read Varian’s paper or any paper he may have produced on his reanalysis of that dataset so I’m basing speculation purely on Kling’s characterization of the result.

But this sort of process and conclusion is familiar and will become more so: using really complicated tools to analyze really complicated datasets and interpreting the results in really simple ways.

Welcome to the era of big data. We are incapable of understanding anymore more than the simplest of descriptions of a dataset: mean, median, mode, percentiles. Do you understand what variance is? I sure don’t. But if can calculate it easily sometimes I can do some neat tricks with associated statistics as long as I have some underlying intuition about the data.

Multi-dimensional, massive datasets are completely non-intuitive. In catastrophe reinsurance we work every day with gigantic datasets and black box models for measuring the risks of hurricanes and earthquakes.

I like to say that the companies that build these models have a great business: they’ve built a tool that is completely non-falsifiable by humans. There are literally millions of random variables in those catastrophe models and isolating and analyzing the error is, if not impossible, at least impractical. You can’t even compare the model result of an actual historical event (say, Katrina or Sandy) with real claims data. From what I can tell talking to them about this kind of exercise, even the model builders themselves calibrate results on an aggregate basis. 

Rule #1 in my practical guide to statistics: you don’t understand aggregates.

Claims of mastery of big data modeling are hard to believe.

Leave a Reply