Tuesday, August 4, 2015

Online Course Review: Coursera and Stanford University's Mining Massive Datasets

It's been quite a while since I last posted—and eight months since I finished the class I'm reviewing. As it happens, I finally got a job as a data scientist in late January, and work has kept me busy. That job will be the subject of my next post, but right now, we're talking about Mining Massive Datasets, offered by Coursera and three professors from Stanford University, Jure Leskovec, Anand Rajaraman, and Jeff Ullman.

The seven-week course covers the same ground as the trio's book Mining of Massive Datasets, though in much less detail. (That link, incidentally, is to the e-book; if you really want the hardcover, you're welcome to follow this sponsored link to Amazon.) It's now been offered twice on Coursera, with a third iteration set to start on September 15th. Oddly, I never intended to take this class: a friend of mine was interested in it, and I signed up so that we could take it together, but initially I had passed on it. I had taken a look at the schedule, seen a few topics that I had covered before in other courses, and decided that I wouldn't get a lot out of it.

What I missed, in my hasty scan of the course description, was that Mining Massive Datasets is not the typical data science course that shows students how to put useful algorithms into practice through code. Instead, this course is about the algorithms themselves: how they work, why they work at scale, and how they've been modified to improve performance or cover different situations. There is a lot of math: this is not a course for someone without a solid background in calculus and linear algebra (it's not like you need to remember how to integrate esoteric functions—but you do need to understand the basics). There are not, on the other hand, any programs to write: many of the exercises absolutely can't be done without writing short scripts or using a statistical language from the command line, but the professors don't require any specific language, and the code isn't graded, only the answers. The point is not to create functioning implementations of the algorithms in question, but rather to understand the nuts and bolts of how they work. The exercises are challenging, and sometimes require consulting the e-book, especially when the complexity of a topic makes it hard to grasp in a short lecture.

Although the bulk of the course is devoted to algorithms, Week 1 provides an excellent description of how HDFS and MapReduce work—without ever giving details on Hadoop or the various languages used to write mappers and reducers. In fact, I came away from Mining Massive Datasets with a far better conceptual grasp of distributed file systems than I got from the Udacity course devoted entirely to the subject. (See my review of that course here.)

It should be said that Jeff Ullman is an excruciatingly monotonic lecturer, who sounds like he's reading everything directly from notes—and you likely wouldn't be at all surprised if he suddenly called out, "Bueller...Bueller...." In addition, his explanations are not as clear as those of Leskovec and Rajaraman. In fairness, though, Ullman tends to cover the most complex topics in the course, and I was always able to figure things out by consulting the book (which covers the material in greater depth, anyway).

In summary, I strongly recommend Mining Massive Datasets for anyone who wants to understand the nitty-gritty of algorithm design for big data. The course is not, however, for the faint of heart. You could make a very successful career in data science without ever taking or it taking anything like it—but taking it will certainly make you better at the profession.