Monday, July 10, 2017

The Relationship between Machine Learning and Statistics

UPDATE from original 7/10/17 version to 9/14/17 version: I erroneously maligned confidence intervals for models of big datasets, conflating them with statistical significance; I've fixed that mistake below.


Like anyone who practices data science, I often get asked, by relatives and acquaintances, what "data science" is. Like any such question, it's not very hard to answer this one to the satisfaction of someone who knows little about the topic: in my case, I tend to describe the discipline as applying the principles of traditional statistics to large amounts of data, and I throw in mentions of the importance of writing code and manipulating databases. Nowadays, you can also mention machine learning, and many people will have at least a vague idea of what you're talking about.

However, even if the answer satisfies most listeners, it bothers me—because I've always wondered exactly where "traditional statistics" ends and "machine learning" begins. Defining that boundary turns out to be surprisingly difficult, but also pretty useful: it's one of those cases where the journey is more important than the destination. It's not really important exactly where we draw that line, but thinking about how machine learning differs from traditional statistics leads to further questions about whether (or rather, when) we can apply the accumulated wisdom of decades of statistical practice and quantitative research to the newer domain, and the answers to those further questions prove to be quite valuable.

The Easy Answer

The easy answer to the question is that machine learning is what statistics becomes when there's too much data to manipulate using traditional statistical algorithms. The most obvious illustration of this transformation is linear regression, where gradient descent replaces direct solution. The transformation has other implications as well: while direct solution (and other traditional algorithms) have long been built into statistical packages like SPSS, until recently, someone who wanted to use gradient descent would have to know at least enough code to install the right package and call the right function. (Mind you, we're seeing more and more machine learning algorithms packaged into easy-to-ease GUI's nowadays, which will leave the coding for those who want to tweak algorithms, create new ones, or build them into applications--much as most users of SPSS never learn scripting, but experts in statistical methods can use it to create powerful extensions to the original package.) Likewise, storing and processing large datasets lends itself to database applications, which can serve up all that data much more efficiently than the traditional method of reading in a CSV.

But the easy answer, while coherent, isn't entirely right. Data scientists actually use a number of techniques that we think of as "machine learning" even when the amounts of data involved are relatively small—indeed, no bigger than what a quantitative researcher in the 1980's would have dealt with. In 2015, my first year as someone with the job title "data scientist", my team worked on a number of demonstration projects for recommender systems. Because we hadn't deployed those systems yet, we usually didn't have real user data, let alone petabytes of it, and even where we did have all of the real data, it wasn't necessarily very big: for example, we used natural language processing (NLP) to measure the similarity between different pages on a website, and that website only had about 1200 pages, each of which included actual content of about two paragraphs. Nonetheless, we never doubted that our applications of collaborative filtering and NLP were "machine learning".

Why? Well, I've never been entirely sure, but I think the answer is that machine learning includes all of those algorithms whose development was prompted by increasing amounts of data and increasing amounts of computing power. The Doc2Vec we used to analyze those 1200 web pages could probably have run on my TRS-80 Color Computer back in the 1980's (it might have been an all-night job), but no one had invented it yet. The same applies to collaborative filters and any number of other recently developed methods that produce useful results even with smallish datasets. All of these algorithms get labeled "machine learning" because they were invented by people who did "machine learning", and, just like the methods used on truly big data, they're usually applied through code rather a traditional statistical package.

However, that's a pretty messy answer, and it really begs the question of the extent to which the difference between traditional statistics and machine learning is a matter of style (or, to put it more nicely, work methods and habits of thought) than of substance.

Interesting Discussion, Scott, But Why Does That Matter?

Yes, there's a point to all this. To wit, the important thing to understand here is that, because there's no bright line between traditional statistics and machine learning, the laws of statistics weren't abolished the first time someone programmed a gradient descent algorithm onto a computer. To me, as a former quantitative researcher in the social sciences, that point has always been blindingly obvious—but in all the machine learning classes I've taken over the years, I've seen only occasional mentions of the relationship between older and newer methods, and I've almost never seen a discussion of the implications of the laws of statistics for machine learning. I've always been struck by this, because really, it's pretty easy to figure out some of those implications.

For example, when your data really is big, you don't have to worry about certain things: the variance due to random sampling is infinitesimal, which means that any differences you find are statistically significant (i.e., if your sample is unbiased, etc., you can be sure the differences are real, though that doesn't in itself imply that they're meaningful). But, as I pointed out above, the data handled by machine learning algorithms isn't always big, and how many data scientists bother to think about exactly how big a dataset has to get before you can stop thinking about significance tests? Confidence intervals present a somewhat more complex problem: with enough data to eliminate error due to random sampling, confidence intervals will be smaller, but when you've got randomness in the model (that is, your model doesn't account for 100% of the variance in outcomes), you still need confidence intervals (or something equivalent) to express the variability of possible outcomes. I've met data scientists who worry about these problems, but not many of them. Heck, for some of the new techniques, like neural nets, I'm not even sure how you'd go about computing a confidence interval. Feel free to Google it: yes, it can be done, but it's not something that even crosses the mind of the average data scientist, and I've never seen the topic so much as mentioned in a machine learning class I've taken.

The implication of statistics that causes me personally the most grief is regularization: regularization is really, really useful because it allows us to solve a linear regression equation even when the number of independent variables (er, sorry, "features") is greater than the number of cases—for someone trained in traditional statistics, it's nothing short of glorious magic, allowing you to do what should be impossible. So why my grief? Well, there are often cases (remember, data today can get very, very big) when the number of lines of data far exceeds the number of features in the model.

Having put much thought into the problem, I cannot figure out a very good reason why you actually need regularization in such a case, and I can see some real downsides to it: it requires more processing, and it will likely produce a less accurate result. And yet, in all of the machine learning classes I've taken, I've never seen a discussion of this issue, and I rarely see a machine learning package whose functions allow the programmer to decide not to use regularization—you can accomplish the same effect by putting in a tiny number (yes, the model still converges without any meaningful regularization, provided you have enough degrees of freedom), but of course, in doing so you can't get the computational advantages of leaving out regularization completely. There's an analogous argument for validation to avoid overfitting: if your dataset is huge, and your training sample is randomly selected, you really shouldn't have overfitting.

I may be utterly wrong on both of these points, but the larger concern is that none of the classes I've taken on machine learning has even raised these issues. The silence is so deafening that, in executing the coding exercises that are often required for job applications, I've submitted regularized models when I knew (or at least suspected) that regularization was pointless (I did, though, note that in my response, and in one case, I submitted an unregularized model alongside the regularized one--I sometimes wonder if that might have kept me from getting the job.) Even if I'm wrong, and the people teaching classes and coding machine learning packages have thought carefully about whether regularization and validation are actually needed in all cases, it would be useful to learn about the reasons for their decisions; after all, there are always situations in which a given method doesn't apply very well, and if you don't understand the assumptions behind a method, you won't be able to identify those situations.

And don't even get me started about the importance of training in statistical research for distinguishing causation from spurious correlationn, as well as avoiding a variety of other analytical pitfalls.

So...when do we start giving every aspiring data scientist real training in statistics?