The Aspirational Data Scientist

Monday, July 10, 2017

The Relationship between Machine Learning and Statistics

UPDATE from original 7/10/17 version to 9/14/17 version: I erroneously maligned confidence intervals for models of big datasets, conflating them with statistical significance; I've fixed that mistake below.

Like anyone who practices data science, I often get asked, by relatives and acquaintances, what "data science" is. Like any such question, it's not very hard to answer this one to the satisfaction of someone who knows little about the topic: in my case, I tend to describe the discipline as applying the principles of traditional statistics to large amounts of data, and I throw in mentions of the importance of writing code and manipulating databases. Nowadays, you can also mention machine learning, and many people will have at least a vague idea of what you're talking about.

However, even if the answer satisfies most listeners, it bothers me—because I've always wondered exactly where "traditional statistics" ends and "machine learning" begins. Defining that boundary turns out to be surprisingly difficult, but also pretty useful: it's one of those cases where the journey is more important than the destination. It's not really important exactly where we draw that line, but thinking about how machine learning differs from traditional statistics leads to further questions about whether (or rather, when) we can apply the accumulated wisdom of decades of statistical practice and quantitative research to the newer domain, and the answers to those further questions prove to be quite valuable.

The Easy Answer

The easy answer to the question is that machine learning is what statistics becomes when there's too much data to manipulate using traditional statistical algorithms. The most obvious illustration of this transformation is linear regression, where gradient descent replaces direct solution. The transformation has other implications as well: while direct solution (and other traditional algorithms) have long been built into statistical packages like SPSS, until recently, someone who wanted to use gradient descent would have to know at least enough code to install the right package and call the right function. (Mind you, we're seeing more and more machine learning algorithms packaged into easy-to-ease GUI's nowadays, which will leave the coding for those who want to tweak algorithms, create new ones, or build them into applications--much as most users of SPSS never learn scripting, but experts in statistical methods can use it to create powerful extensions to the original package.) Likewise, storing and processing large datasets lends itself to database applications, which can serve up all that data much more efficiently than the traditional method of reading in a CSV.

But the easy answer, while coherent, isn't entirely right. Data scientists actually use a number of techniques that we think of as "machine learning" even when the amounts of data involved are relatively small—indeed, no bigger than what a quantitative researcher in the 1980's would have dealt with. In 2015, my first year as someone with the job title "data scientist", my team worked on a number of demonstration projects for recommender systems. Because we hadn't deployed those systems yet, we usually didn't have real user data, let alone petabytes of it, and even where we did have all of the real data, it wasn't necessarily very big: for example, we used natural language processing (NLP) to measure the similarity between different pages on a website, and that website only had about 1200 pages, each of which included actual content of about two paragraphs. Nonetheless, we never doubted that our applications of collaborative filtering and NLP were "machine learning".

Why? Well, I've never been entirely sure, but I think the answer is that machine learning includes all of those algorithms whose development was prompted by increasing amounts of data and increasing amounts of computing power. The Doc2Vec we used to analyze those 1200 web pages could probably have run on my TRS-80 Color Computer back in the 1980's (it might have been an all-night job), but no one had invented it yet. The same applies to collaborative filters and any number of other recently developed methods that produce useful results even with smallish datasets. All of these algorithms get labeled "machine learning" because they were invented by people who did "machine learning", and, just like the methods used on truly big data, they're usually applied through code rather a traditional statistical package.

However, that's a pretty messy answer, and it really begs the question of the extent to which the difference between traditional statistics and machine learning is a matter of style (or, to put it more nicely, work methods and habits of thought) than of substance.

Interesting Discussion, Scott, But Why Does That Matter?

Yes, there's a point to all this. To wit, the important thing to understand here is that, because there's no bright line between traditional statistics and machine learning, the laws of statistics weren't abolished the first time someone programmed a gradient descent algorithm onto a computer. To me, as a former quantitative researcher in the social sciences, that point has always been blindingly obvious—but in all the machine learning classes I've taken over the years, I've seen only occasional mentions of the relationship between older and newer methods, and I've almost never seen a discussion of the implications of the laws of statistics for machine learning. I've always been struck by this, because really, it's pretty easy to figure out some of those implications.

For example, when your data really is big, you don't have to worry about certain things: the variance due to random sampling is infinitesimal, which means that any differences you find are statistically significant (i.e., if your sample is unbiased, etc., you can be sure the differences are real, though that doesn't in itself imply that they're meaningful). But, as I pointed out above, the data handled by machine learning algorithms isn't always big, and how many data scientists bother to think about exactly how big a dataset has to get before you can stop thinking about significance tests? Confidence intervals present a somewhat more complex problem: with enough data to eliminate error due to random sampling, confidence intervals will be smaller, but when you've got randomness in the model (that is, your model doesn't account for 100% of the variance in outcomes), you still need confidence intervals (or something equivalent) to express the variability of possible outcomes. I've met data scientists who worry about these problems, but not many of them. Heck, for some of the new techniques, like neural nets, I'm not even sure how you'd go about computing a confidence interval. Feel free to Google it: yes, it can be done, but it's not something that even crosses the mind of the average data scientist, and I've never seen the topic so much as mentioned in a machine learning class I've taken.

The implication of statistics that causes me personally the most grief is regularization: regularization is really, really useful because it allows us to solve a linear regression equation even when the number of independent variables (er, sorry, "features") is greater than the number of cases—for someone trained in traditional statistics, it's nothing short of glorious magic, allowing you to do what should be impossible. So why my grief? Well, there are often cases (remember, data today can get very, very big) when the number of lines of data far exceeds the number of features in the model.

Having put much thought into the problem, I cannot figure out a very good reason why you actually need regularization in such a case, and I can see some real downsides to it: it requires more processing, and it will likely produce a less accurate result. And yet, in all of the machine learning classes I've taken, I've never seen a discussion of this issue, and I rarely see a machine learning package whose functions allow the programmer to decide not to use regularization—you can accomplish the same effect by putting in a tiny number (yes, the model still converges without any meaningful regularization, provided you have enough degrees of freedom), but of course, in doing so you can't get the computational advantages of leaving out regularization completely. There's an analogous argument for validation to avoid overfitting: if your dataset is huge, and your training sample is randomly selected, you really shouldn't have overfitting.

I may be utterly wrong on both of these points, but the larger concern is that none of the classes I've taken on machine learning has even raised these issues. The silence is so deafening that, in executing the coding exercises that are often required for job applications, I've submitted regularized models when I knew (or at least suspected) that regularization was pointless (I did, though, note that in my response, and in one case, I submitted an unregularized model alongside the regularized one--I sometimes wonder if that might have kept me from getting the job.) Even if I'm wrong, and the people teaching classes and coding machine learning packages have thought carefully about whether regularization and validation are actually needed in all cases, it would be useful to learn about the reasons for their decisions; after all, there are always situations in which a given method doesn't apply very well, and if you don't understand the assumptions behind a method, you won't be able to identify those situations.

And don't even get me started about the importance of training in statistical research for distinguishing causation from spurious correlationn, as well as avoiding a variety of other analytical pitfalls.

So...when do we start giving every aspiring data scientist real training in statistics?

Tuesday, August 4, 2015

Online Course Review: Coursera and Stanford University's Mining Massive Datasets

It's been quite a while since I last posted—and eight months since I finished the class I'm reviewing. As it happens, I finally got a job as a data scientist in late January, and work has kept me busy. That job will be the subject of my next post, but right now, we're talking about Mining Massive Datasets, offered by Coursera and three professors from Stanford University, Jure Leskovec, Anand Rajaraman, and Jeff Ullman.

The seven-week course covers the same ground as the trio's book Mining of Massive Datasets, though in much less detail. (That link, incidentally, is to the e-book; if you really want the hardcover, you're welcome to follow this sponsored link to Amazon.) It's now been offered twice on Coursera, with a third iteration set to start on September 15th. Oddly, I never intended to take this class: a friend of mine was interested in it, and I signed up so that we could take it together, but initially I had passed on it. I had taken a look at the schedule, seen a few topics that I had covered before in other courses, and decided that I wouldn't get a lot out of it.

What I missed, in my hasty scan of the course description, was that Mining Massive Datasets is not the typical data science course that shows students how to put useful algorithms into practice through code. Instead, this course is about the algorithms themselves: how they work, why they work at scale, and how they've been modified to improve performance or cover different situations. There is a lot of math: this is not a course for someone without a solid background in calculus and linear algebra (it's not like you need to remember how to integrate esoteric functions—but you do need to understand the basics). There are not, on the other hand, any programs to write: many of the exercises absolutely can't be done without writing short scripts or using a statistical language from the command line, but the professors don't require any specific language, and the code isn't graded, only the answers. The point is not to create functioning implementations of the algorithms in question, but rather to understand the nuts and bolts of how they work. The exercises are challenging, and sometimes require consulting the e-book, especially when the complexity of a topic makes it hard to grasp in a short lecture.

Although the bulk of the course is devoted to algorithms, Week 1 provides an excellent description of how HDFS and MapReduce work—without ever giving details on Hadoop or the various languages used to write mappers and reducers. In fact, I came away from Mining Massive Datasets with a far better conceptual grasp of distributed file systems than I got from the Udacity course devoted entirely to the subject. (See my review of that course here.)

It should be said that Jeff Ullman is an excruciatingly monotonic lecturer, who sounds like he's reading everything directly from notes—and you likely wouldn't be at all surprised if he suddenly called out, "Bueller...Bueller...." In addition, his explanations are not as clear as those of Leskovec and Rajaraman. In fairness, though, Ullman tends to cover the most complex topics in the course, and I was always able to figure things out by consulting the book (which covers the material in greater depth, anyway).

In summary, I strongly recommend Mining Massive Datasets for anyone who wants to understand the nitty-gritty of algorithm design for big data. The course is not, however, for the faint of heart. You could make a very successful career in data science without ever taking or it taking anything like it—but taking it will certainly make you better at the profession.

Saturday, September 13, 2014

The Ethical Challenge of "Passive Predation" in Data Science: Can Data Science Provide the Solution, and Not Just the Problem?

I recently ran across an intriguing blog post from Michael Malek, on "Predatory Data Science". Malek notes that data science methods, especially "black box" machine learning, can unintentionally create what he calls "passive predation"—that is, taking advantage of some vulnerable group despite having no intention to do so. He uses the example of a machine learning model, created for a gun manufacturer, that ends up targeting marketing efforts at the suicidal, by identifying keywords associated with depression. The data scientist using the tool in question wouldn't have intended that result, and probably would never even be aware of it, because the group of suicidal depressives would be buried amidst thousands of other micro-segments identified by the same application.

Malek perhaps overdraws his point in the middle part of the post—a historical account of the dehumanizing effects of technology that's reminiscent of Marx's condemnation of working for money in "The Alienation of Labor"—but his main argument is quite sound, and not a little scary.

I wonder, though, if data science itself could provide a solution to this problem. I hereby announce a very unofficial contest, with prizes that will prove trivial at best (I might take a winner out to lunch, or talk about his or her idea at a Data Comunity DC meetup). Pretty much any method of accomplishing this goal, technical or non-technical, is fair game. Any takers?

Thursday, September 11, 2014

Online Course Review: Udacity's Intro to Hadoop and MapReduce

For my first course on Udacity, I decided to take Intro to Hadoop and MapReduce, a course created in conjunction with Cloudera, a company whose business model is based on the open-source Apache Hadoop. To sum up my asseessment, the course was useful, but could have been done much better.

The four-lesson course (short by Udacity standards) is supposed to take about a month to complete—like all Udacity course, and unlike those of Coursera, this is not a true MOOC, taken alongside other students in real time, but rather an interactive tutorial. However, Udacity's model does feature student discussion forums; customers who pay (at the rate of $150/month) also get help from live coaches, feedback on their final projects, and the opportunity to earn a "verified certificate", similar to Coursera's Signature Track, with the difference that Udacity, unlike Coursera, no longer offers certificates for non-paying students. (As I've mentioned before, a verified certificate and two dollars may buy you a cup of coffee, but I wouldn't count on its having any greater worth.)

Before I delve into the specifics of this course, let me say that I'm not a real fan of the Udacity interface. While both providers break each lesson up into a series of short videos, Coursera labels each of those videos with a topic, making it relatively easy to go back and find the material you need; by contrast, Udacity strings all the videos for a particular lesson together under a single heading, and so you have to hunt through all of them to find something (you can click on individual videos, and each one has its own label, but you have to click on or hover over a video to see the label). In addition, whenever the video stops for a quiz, it drops out of fullscreen (assuming you're in fullscreen, of course). Moreover, Udacity's discussion forum (note the singular there) has no organization whatsoever, aside from keyword tags—making a search for specific information rather laborious.

Thr first three lessons of this particular course, which features two instructors from Cloudera, are structured in a manner that the director of a music video would appreciate: many of the videos are very short, and switch jarringly from one instructor to the other. Nonetheless, the instructors are engaging, and there's a nice interview with Doug Cutting about how he helped to create Hadoop, and named it after his toddler son's stuffed elephant. The first two lessons, which explain the basics of how Hadoop and HDFS work, can best be described as "lite"—unchallenging nearly to the point of tedium.

Lesson 3 marks an abrupt change: this is where the programming exercises began. The class requires previous experience with Python, which I lacked, and so the exercises took more time for me than they should have, but I managed. One student in the forum questioned whether this was a course on Hadoop, or a course on Python regular expressions, but doing the exercises helped me learn some Python, and, much as I hate the language, it does have a very powerful vocabulary of regular expressions. Unfortunately, the instructor blew by the concept of Hadoop streaming so fast (in Lesson 2) that I wasn't entirely sure for a while what exactly I was doing, though I was managing to get it to work—and once I looked up Hadoop streaming on my own (it is, for the record, an API that allows Hadoop mappers and reducers to be written to be written in any language), I realized that the interface would work just as well for R.

Although the simpler exercises use an online Python compiler, for the exercises that require large datasets, the course's creators deserve kudos for having students install a virtual UNIX box on which a virtual two-machine Hadoop cluster has already been set up, and then manipulate data and write code in this realistic environment. Unfortunately, the exercises that require this virtual machine seem half-baked.

First off, the instructors haven't actually detailed how to write and execute Python scripts on the UNIX machine (the class discussion forum was very helpful here). Second, the syntax needed to make the scripts work is different from the syntax presented in the video lectures (though, fortunately, there are working sample scripts saved on the virtual machine). Third, and most seriously, one particularly tricky exercise requires knowledge that students could not possibly get from the instructions, or, in all probability, the data itself, but could only get from the hints that emerge from a trial-and-error process of submitting answers to the automated grader—it was an interesting little mystery to solve, but there are no automated graders in real life, and so I'm not sure what I gained from the effort.

Yes, figuring out ambiguous instructions does have some pedagogical value, and in the end, completing the exercises was very satisfying, but, especially in the case of the problem that was insoluble without the automated grader, I got the feeling that the difficultes I faced were the result, not of a pedagological choice, but of a simple lack of effort on the part of the instructors—and I felt like I had wasted part of my time.

According to posts in the forum, Lesson 4 was not part of the original class, though I'm not sure if it was planned all along, or tacked on later. To paraphrase Monty Python and the Holy Grail, the course was completed in an entirely different style at great expense and at the last minute. The lectures feature a different intstructor, a Udacity employee, in place of the Cloudera instructors. This lesson covers design patterns, specifically filtering patterns (more regular expressions), summarization patterns (minimums, maximums, and means, for example), and structural patterns (combining data sets); one lecture also deals with combiners, scripts inserted between mappers and reducers to make things more efficient by doing some of the reduction locally on each machine in the cluster.

I found these lectures better than the previous ones, and the exercises better prepared. I will say, though, that I eventually got bored with writing new and different regular expressions in Python, and didn't finish the last few exercises (or the final project, which isn't graded for non-paying students in any case), though I did watch all of the lectures.

In the end, this half-baked pastiche of a course at least gave me a decent idea of how Hadoop works, and removed the mystique of manipulating data stored on a Hadoop cluster. I wouldn't know how to set up a cluster myself (that wasn't the intent of the class, though I don't think it would be all that hard to do), but I do know how to use Hadoop streaming—and I've realized it's not exactly rocket science.

Monday, August 4, 2014

Online Course Review: Exploratory Data Analysis, from Coursera's Data Science Specialization

Back in May, I reviewed two of the short courses that make up Coursera's Data Science specialization. Although the four-week format greatly limits the content of any one course, I was generally impressed by the scientific approach of the specialization (something all too often lacking in data "science"), and, in the case of Getting and Cleaning Data, by the many pointers provided to R packages and sources of information for further study: the course may not have gone into a lot of depth, but it provided a good overview of what you can do with R.

I recently completed a third course in the specialization, Exploratory Data Analysis, taught by Roger D. Peng (the previous courses I took were taught by Jeff Leek). While I enjoy Peng's lecture style (unlike Leek, he engages the audience by showing his face at the beginnings of lectures), and I learned a lot, the course suffers greatly from the short format.

I initially overlooked this class: from the name (more on this in a minute), I never would have guessed that 3/4 of the lectures would cover graphics in R. Peng teaches the basics of the language's three major graphics packages, the base graphics, lattice, and ggplot2. As is the case for Getting and Cleaning Data, the lectures manage only to skim the surface, particularly for ggplot2, but they do give the student a decent idea of what's possible in R. I do though think that for ggplot2 Peng could do a better job of outling the advanced features than simply pointing students to the book written by the package's author, Hadley Wickham (thankfully, it's possible to find free PDF's of the book online, but I'm not sure it's where I'd want to start for solving a discrete problem, rather than studying ggplot2 in a methodical way).

So what's with the name of the course? Peng presents visualization in R as a way of conducting initial exploration of data, but it's obviously useful for more than that, since R can create decent visualizations of the results of analysis. I suspect that the course name was chosen so that one week of lectures on clustering and dimensionality reduction could be shoehorned into the syllabus. This material probably belongs instead in the Pratical Machine Learning course, but something had to be cut to limit that course to four weeks (cf. the nine-week Machine Learning, also offered by Coursera, and which I've reviewed previously—twice, actually). The fact that clustering and dimensionality reduction can be used for exploratory analysis and visualization is the only thing that ties the entire course together.

What's particular disturbing is the way that all of this combines with the specialization's unique approach to exercises and evaluation. Each course includes a hands-on project, and, because open-ended projects in a MOOC must, for logistical reasons, be graded using a peer grading system, the final project for Exploratory Data Analysis only ends up covering material from the first two weeks of the course, since students need the third week to work on the project, followed by the fourth week to grade it; therefore, half the content of the class doesn't play any role in the project. On top of this—I suppose to avoid overloading students—there's no quiz, homework, or any other form of practice or evaluation covering the material on clustering and dimensionality reduction, which makes it hard for a student to know if he or she really understands those topics.

To sum up, I did find the information on data visualization in R useful, but I would have appreciated a full four weeks on the subject. The coverage of clustering and dimensionality reduction was out of place in the course; nonetheless, many will find it valuable (I had already seen most if not all of it in Machine Learning and another Coursera course, Social Network Analysis, which I've also reviewed).

I do have one more comment, though this applies to the Data Science specialization in general, and to Coursera, rather than solely to this course. Normally, after completing a Coursera course, a student can go back and look at the course archives at any later time; I've found this valuable when I suddenly find myself needing to refresh my memory or find out where I can learn more about a topic. Coursera has apparently disabled this feature for the Data Science courses: their archives are no longer accessible after the grading period is over (about a week after the finish of a course). I say "apparently" because, when I contacted Coursera a few months ago to ask why I could no longer access the archives of Getting and Cleaning Data, I never got a response—this is becoming something of a theme with Coursera, which, as I noted in my second review of Machine Learning, ignores most bug reports for that class. I suppose that paying customers might get better service, but I'm not going to pay just to find out if that's true.

Of course, you can always sign up for the current iteration of a class, since they're offered continuously, but it's annoying to have to do that each month. Fortunately, all of the class materials are also available in a GitHub repository, but it's not as easy to display documents on GitHub as in Coursera's web interface. For a set of courses that only skim the surface, and whose major value is in providing links to deeper information, this is a major failing.

Programming Languages for Big Data, Part 3

And now, one more word on the subject of R's speed. At my prodding, Tommy Jones contacted the authors of the study on programming language speed, and a productive discussion ensued. It turns out that the task in question was one that couldn't be vectorized, which means that R's main strength couldn't be applied in this case. However, it was possible to speed it up by writing C++ functions in R using Rcpp. The authors tried this, and revised their paper, reporting that, using Rcpp, R performed the task only 4-5 times slower than C++. For details, see Tommy's blog post, and the revised paper.

Friday, July 11, 2014

Programming Languages for Big Data, Part 2

I mentioned the recent study on the relative speeds of programming languages to Tommy Jones, a specialist in natural language processing and fellow member of the Data Community DC, and he, being more industrious than I, dove into the code used by the authors of the paper in question. In their R code, he found gems such as a triple-nested "for" loop inside a "while" loop (instead of the much faster "apply" functions), which made the comparisons pretty useless, at least in the case of R. See Tommy's blog, Biased Estimates, for more details.

Nonetheless, it's a pretty interesting question, and I'd love to see someone who's proficient in all of the languages involved try this test again, using better code. I'm still intrigued by the very high speed of MATLAB/Octave—something that leads Andrew Ng to recommend those languages over R for prototyping—though Tommy pointed out to me that, since R is closer to being a full-featured language, it's more flexible than the former languages.

Labels