Monday, July 10, 2017

The Relationship between Machine Learning and Statistics

UPDATE from original 7/10/17 version to 9/14/17 version: I erroneously maligned confidence intervals for models of big datasets, conflating them with statistical significance; I've fixed that mistake below.

Like anyone who practices data science, I often get asked, by relatives and acquaintances, what "data science" is. Like any such question, it's not very hard to answer this one to the satisfaction of someone who knows little about the topic: in my case, I tend to describe the discipline as applying the principles of traditional statistics to large amounts of data, and I throw in mentions of the importance of writing code and manipulating databases. Nowadays, you can also mention machine learning, and many people will have at least a vague idea of what you're talking about.

However, even if the answer satisfies most listeners, it bothers me—because I've always wondered exactly where "traditional statistics" ends and "machine learning" begins. Defining that boundary turns out to be surprisingly difficult, but also pretty useful: it's one of those cases where the journey is more important than the destination. It's not really important exactly where we draw that line, but thinking about how machine learning differs from traditional statistics leads to further questions about whether (or rather, when) we can apply the accumulated wisdom of decades of statistical practice and quantitative research to the newer domain, and the answers to those further questions prove to be quite valuable.

The Easy Answer

The easy answer to the question is that machine learning is what statistics becomes when there's too much data to manipulate using traditional statistical algorithms. The most obvious illustration of this transformation is linear regression, where gradient descent replaces direct solution. The transformation has other implications as well: while direct solution (and other traditional algorithms) have long been built into statistical packages like SPSS, until recently, someone who wanted to use gradient descent would have to know at least enough code to install the right package and call the right function. (Mind you, we're seeing more and more machine learning algorithms packaged into easy-to-ease GUI's nowadays, which will leave the coding for those who want to tweak algorithms, create new ones, or build them into applications--much as most users of SPSS never learn scripting, but experts in statistical methods can use it to create powerful extensions to the original package.) Likewise, storing and processing large datasets lends itself to database applications, which can serve up all that data much more efficiently than the traditional method of reading in a CSV.

But the easy answer, while coherent, isn't entirely right. Data scientists actually use a number of techniques that we think of as "machine learning" even when the amounts of data involved are relatively small—indeed, no bigger than what a quantitative researcher in the 1980's would have dealt with. In 2015, my first year as someone with the job title "data scientist", my team worked on a number of demonstration projects for recommender systems. Because we hadn't deployed those systems yet, we usually didn't have real user data, let alone petabytes of it, and even where we did have all of the real data, it wasn't necessarily very big: for example, we used natural language processing (NLP) to measure the similarity between different pages on a website, and that website only had about 1200 pages, each of which included actual content of about two paragraphs. Nonetheless, we never doubted that our applications of collaborative filtering and NLP were "machine learning".

Why? Well, I've never been entirely sure, but I think the answer is that machine learning includes all of those algorithms whose development was prompted by increasing amounts of data and increasing amounts of computing power. The Doc2Vec we used to analyze those 1200 web pages could probably have run on my TRS-80 Color Computer back in the 1980's (it might have been an all-night job), but no one had invented it yet. The same applies to collaborative filters and any number of other recently developed methods that produce useful results even with smallish datasets. All of these algorithms get labeled "machine learning" because they were invented by people who did "machine learning", and, just like the methods used on truly big data, they're usually applied through code rather a traditional statistical package.

However, that's a pretty messy answer, and it really begs the question of the extent to which the difference between traditional statistics and machine learning is a matter of style (or, to put it more nicely, work methods and habits of thought) than of substance.

Interesting Discussion, Scott, But Why Does That Matter?

Yes, there's a point to all this. To wit, the important thing to understand here is that, because there's no bright line between traditional statistics and machine learning, the laws of statistics weren't abolished the first time someone programmed a gradient descent algorithm onto a computer. To me, as a former quantitative researcher in the social sciences, that point has always been blindingly obvious—but in all the machine learning classes I've taken over the years, I've seen only occasional mentions of the relationship between older and newer methods, and I've almost never seen a discussion of the implications of the laws of statistics for machine learning. I've always been struck by this, because really, it's pretty easy to figure out some of those implications.

For example, when your data really is big, you don't have to worry about certain things: the variance due to random sampling is infinitesimal, which means that any differences you find are statistically significant (i.e., if your sample is unbiased, etc., you can be sure the differences are real, though that doesn't in itself imply that they're meaningful). But, as I pointed out above, the data handled by machine learning algorithms isn't always big, and how many data scientists bother to think about exactly how big a dataset has to get before you can stop thinking about significance tests? Confidence intervals present a somewhat more complex problem: with enough data to eliminate error due to random sampling, confidence intervals will be smaller, but when you've got randomness in the model (that is, your model doesn't account for 100% of the variance in outcomes), you still need confidence intervals (or something equivalent) to express the variability of possible outcomes. I've met data scientists who worry about these problems, but not many of them. Heck, for some of the new techniques, like neural nets, I'm not even sure how you'd go about computing a confidence interval. Feel free to Google it: yes, it can be done, but it's not something that even crosses the mind of the average data scientist, and I've never seen the topic so much as mentioned in a machine learning class I've taken.

The implication of statistics that causes me personally the most grief is regularization: regularization is really, really useful because it allows us to solve a linear regression equation even when the number of independent variables (er, sorry, "features") is greater than the number of cases—for someone trained in traditional statistics, it's nothing short of glorious magic, allowing you to do what should be impossible. So why my grief? Well, there are often cases (remember, data today can get very, very big) when the number of lines of data far exceeds the number of features in the model.

Having put much thought into the problem, I cannot figure out a very good reason why you actually need regularization in such a case, and I can see some real downsides to it: it requires more processing, and it will likely produce a less accurate result. And yet, in all of the machine learning classes I've taken, I've never seen a discussion of this issue, and I rarely see a machine learning package whose functions allow the programmer to decide not to use regularization—you can accomplish the same effect by putting in a tiny number (yes, the model still converges without any meaningful regularization, provided you have enough degrees of freedom), but of course, in doing so you can't get the computational advantages of leaving out regularization completely. There's an analogous argument for validation to avoid overfitting: if your dataset is huge, and your training sample is randomly selected, you really shouldn't have overfitting.

I may be utterly wrong on both of these points, but the larger concern is that none of the classes I've taken on machine learning has even raised these issues. The silence is so deafening that, in executing the coding exercises that are often required for job applications, I've submitted regularized models when I knew (or at least suspected) that regularization was pointless (I did, though, note that in my response, and in one case, I submitted an unregularized model alongside the regularized one--I sometimes wonder if that might have kept me from getting the job.) Even if I'm wrong, and the people teaching classes and coding machine learning packages have thought carefully about whether regularization and validation are actually needed in all cases, it would be useful to learn about the reasons for their decisions; after all, there are always situations in which a given method doesn't apply very well, and if you don't understand the assumptions behind a method, you won't be able to identify those situations.

And don't even get me started about the importance of training in statistical research for distinguishing causation from spurious correlationn, as well as avoiding a variety of other analytical pitfalls.

So...when do we start giving every aspiring data scientist real training in statistics?


  1. This comment has been removed by the author.

  2. Really nice post Thank you for sharing it

  3. Ambulance Service in Patna provide Best services in hajipur, Muzaffarpur, Darbhanga, Samastipur, Mumbai at low cost.Private, air, AC, Road, Oxygen service in Bihar.ICU, Ground, Domestic, Critical, Cardiac, Funeral, Coffin, emergency, accidental care, dead body packing, rail, train facility available.
    Patna Ambulance
    Patna Ambulance service
    Ambulance in Patna
    Ambulance service in Patna
    Best Ambulance service in Patna
    Icu Ambulance in Patna
    PMCH Ambulance
    igims Ambulance
    paras ambulance
    patna aiims ambulance
    Ambulance Number
    Ambulance Number in Patna
    Ambulance Near me
    Ambulance in hajipur
    Ambulance number in hajpur
    Ambulance in samastipur
    Ambulance Number in samastipur

  4. At Superfastprocessing, we use a range of servers with high-fault tolerance and equipped with load balancers. The load balancers ensure high availability of servers at all times.

  5. Enjoyed reading the article above, really explains everything in detail, the article is very interesting and effective. Thank you and good luck for the upcoming articles machine learning training

  6. This comment has been removed by the author.

  7. Great blog. Our key clientele includes the largest global luxury car manufacturer, one of the largest manufacturers of automotive sound systems & major players in the connected car space, a large rubber manufacturer & a key player in the medical fraternity. Visit:Livewire Technologies

  8. Nice blog writing. very informative. Innomatics research labs is the best Data Science Course in Hyderabad, and a best place to learn machine learning.
    Data Science Course training institute in Hyderabad

  9. This is amazing, very rare to find these type of blogs. Must say very well written.

    Hadoop Training

  10. Greetings! Very helpful advice within this article! It is the little changes that produce the largest changes. Many thanks for sharing!
    UI Development Training in Bangalore
    Reactjs Training in Bangalore
    PHP Training in Bangalore

  11. I am happy for sharing on this blog its awesome blog I really impressed. thanks for sharing. Great efforts.

    Looking for Big Data Hadoop Training Institute in Bangalore, India. Prwatech is the best one to offers computer training courses including IT software course in Bangalore, India.

    Also it provides placement assistance service in Bangalore for IT. R Programming Training Institute in Bangalore.

  12. Nice Blog | Thank you for sharing Such a wonderful information. The data that you provided in the blog is informative and effective.

    Data Science Training in Hyderabad

  13. Nice blog,I understood the topic very clearly,And want to study more like this.
    Data Science Training in Hyderabad

  14. I read your blog and i found it very interesting and useful blog for me. I hope you will post more like this, i am very thankful to you for these type of post.
    Visit :
    Thank you.

  15. Thanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuff like this.
    Online Data Science Training in Pune, Mumbai, Delhi NCR

  16. Thanks for Sharing This Article.It is very so much valuable content. I hope these Commenting lists will help to my website
    servicenow online training
    best servicenow online training
    top servicenow online training

  17. Thank you for taking the time to provide us with your valuable information. We strive to provide our candidates with excellent care


  19. I agree with this and very useful content for surfers. Author should share more information like this i am expecting to learn more from this blog. thank you very much!
    data science course chennai

  20. Such a very useful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article.
    business analytics course
    data analytics course
    data science course

  21. Greatly explained about the relationship between Machine Learning and Statistics.

    Looking for Digital Transformation Company? Contact Way2Smile Solutions DMCC.

  22. You are in point of fact a just right webmaster. The website loading speed is amazing. It kind of feels that you're doing any distinctive trick. Moreover, The contents are masterpiece. you have done a fantastic activity on this subject!
    Business Analytics Course in Hyderabad | Business Analytics Training in Hyderabad

  23. I feel really happy to have seen your webpage and look forward to so many more entertaining times reading here. Thanks once more for all the details.
    Data Science Training in Hyderabad | Data Science Course in Hyderabad

  24. Congratulation for the great post. Those who come to read your Information will find lots of helpful and informative tips. Designing and Implementing a Data Science Solution on Azure course DP-100

  25. Thanks for sharing web page about knowledge.
    Visit: Data Science


  26. I like this post,And I figure that they having a great time to peruse this post,they might take a decent site to make an information,thanks for sharing it to me Pretty good post. ExcelR Data science Courses

  27. I will truly value the essayist's decision for picking this magnificent article fitting to my matter.Here is profound depiction about the article matter which helped me more.
    data analytics course

  28. You finished certain solid focuses there. I did a pursuit regarding the matter and discovered almost all people will concur with your blog.


  29. Currently data scientist and data analyst are in demand thanks for sharing this excellent blog.
    Data Analytics 360DigiTMG

  30. Viably, the article is actually the best point on this library related issue. I fit in with your choices and will enthusiastically foresee your next updates.

    hrdf claimable training

  31. "
    I was just examining through the web looking for certain information and ran over your blog.It shows how well you understand this subject. Bookmarked this page, will return for extra."

    data science course in malaysia

  32. Thank you for sharing useful information with us. please keep sharing like this.
    And if anyone like to take admission in Dehardun then check this.

    Tula's Institute Best B.J.M.C College in Dehradun

  33. Especially superb!!! Exactly when I search for this I found this webpage at the top of every single online diary in web crawler.
    360DigiTMG data analytics course

  34. This is a great motivational article. In fact, I am happy with your good work. They publish very supportive data, really. Continue. Continue blogging. Hope you explore your next post

    360DigiTMG data science training

  35. You finished certain solid focuses there. I did a pursuit regarding the matter and discovered almost all people will concur with your blog.
    iot courses in delhi

  36. Am really impressed about this blog because this blog is very easy to learn and understand clearly.This blog is very useful for the college students and researchers to take a good notes in good manner,I gained many unknown information.
    Data Science Training In Chennai

    Data Science Online Training In Chennai

    Data Science Training In Bangalore

    Data Science Training In Hyderabad

    Data Science Training In Coimbatore

    Data Science Training

    Data Science Online Training

  37. Thanks for provide great informatics and looking beautiful blog, really nice required information & the things i never imagined and i would request, wright more blog and blog post like that for us. Thanks you
    DevOps Training in Chennai

    DevOps Online Training in Chennai

    DevOps Training in Bangalore

    DevOps Training in Hyderabad

    DevOps Training in Coimbatore

    DevOps Training

    DevOps Online Training

  38. an extremely wonderful post this is. Genuinely, perhaps the best post I've at any point seen to find in as long as I can remember. Goodness, simply keep it up.
    what is hrdf claimable

  39. You finished certain solid focuses there. I did a pursuit regarding the matter and discovered almost all people will concur with your blog.
    hrdf training course