Tuesday, August 4, 2015

Online Course Review: Coursera and Stanford University's Mining Massive Datasets

It's been quite a while since I last posted—and eight months since I finished the class I'm reviewing. As it happens, I finally got a job as a data scientist in late January, and work has kept me busy. That job will be the subject of my next post, but right now, we're talking about Mining Massive Datasets, offered by Coursera and three professors from Stanford University, Jure Leskovec, Anand Rajaraman, and Jeff Ullman.

The seven-week course covers the same ground as the trio's book Mining of Massive Datasets, though in much less detail. (That link, incidentally, is to the e-book; if you really want the hardcover, you're welcome to follow this sponsored link to Amazon.) It's now been offered twice on Coursera, with a third iteration set to start on September 15th. Oddly, I never intended to take this class: a friend of mine was interested in it, and I signed up so that we could take it together, but initially I had passed on it. I had taken a look at the schedule, seen a few topics that I had covered before in other courses, and decided that I wouldn't get a lot out of it.

What I missed, in my hasty scan of the course description, was that Mining Massive Datasets is not the typical data science course that shows students how to put useful algorithms into practice through code. Instead, this course is about the algorithms themselves: how they work, why they work at scale, and how they've been modified to improve performance or cover different situations. There is a lot of math: this is not a course for someone without a solid background in calculus and linear algebra (it's not like you need to remember how to integrate esoteric functions—but you do need to understand the basics). There are not, on the other hand, any programs to write: many of the exercises absolutely can't be done without writing short scripts or using a statistical language from the command line, but the professors don't require any specific language, and the code isn't graded, only the answers. The point is not to create functioning implementations of the algorithms in question, but rather to understand the nuts and bolts of how they work. The exercises are challenging, and sometimes require consulting the e-book, especially when the complexity of a topic makes it hard to grasp in a short lecture.

Although the bulk of the course is devoted to algorithms, Week 1 provides an excellent description of how HDFS and MapReduce work—without ever giving details on Hadoop or the various languages used to write mappers and reducers. In fact, I came away from Mining Massive Datasets with a far better conceptual grasp of distributed file systems than I got from the Udacity course devoted entirely to the subject. (See my review of that course here.)

It should be said that Jeff Ullman is an excruciatingly monotonic lecturer, who sounds like he's reading everything directly from notes—and you likely wouldn't be at all surprised if he suddenly called out, "Bueller...Bueller...." In addition, his explanations are not as clear as those of Leskovec and Rajaraman. In fairness, though, Ullman tends to cover the most complex topics in the course, and I was always able to figure things out by consulting the book (which covers the material in greater depth, anyway).

In summary, I strongly recommend Mining Massive Datasets for anyone who wants to understand the nitty-gritty of algorithm design for big data. The course is not, however, for the faint of heart. You could make a very successful career in data science without ever taking or it taking anything like it—but taking it will certainly make you better at the profession.

Saturday, September 13, 2014

The Ethical Challenge of "Passive Predation" in Data Science: Can Data Science Provide the Solution, and Not Just the Problem?

I recently ran across an intriguing blog post from Michael Malek, on "Predatory Data Science". Malek notes that data science methods, especially "black box" machine learning, can unintentionally create what he calls "passive predation"—that is, taking advantage of some vulnerable group despite having no intention to do so. He uses the example of a machine learning model, created for a gun manufacturer, that ends up targeting marketing efforts at the suicidal, by identifying keywords associated with depression. The data scientist using the tool in question wouldn't have intended that result, and probably would never even be aware of it, because the group of suicidal depressives would be buried amidst thousands of other micro-segments identified by the same application.

Malek perhaps overdraws his point in the middle part of the post—a historical account of the dehumanizing effects of technology that's reminiscent of Marx's condemnation of working for money in "The Alienation of Labor"—but his main argument is quite sound, and not a little scary.

I wonder, though, if data science itself could provide a solution to this problem. I hereby announce a very unofficial contest, with prizes that will prove trivial at best (I might take a winner out to lunch, or talk about his or her idea at a Data Comunity DC meetup). Pretty much any method of accomplishing this goal, technical or non-technical, is fair game. Any takers?

Thursday, September 11, 2014

Online Course Review: Udacity's Intro to Hadoop and MapReduce

For my first course on Udacity, I decided to take Intro to Hadoop and MapReduce, a course created in conjunction with Cloudera, a company whose business model is based on the open-source Apache Hadoop. To sum up my asseessment, the course was useful, but could have been done much better.

The four-lesson course (short by Udacity standards) is supposed to take about a month to complete—like all Udacity course, and unlike those of Coursera, this is not a true MOOC, taken alongside other students in real time, but rather an interactive tutorial. However, Udacity's model does feature student discussion forums; customers who pay (at the rate of $150/month) also get help from live coaches, feedback on their final projects, and the opportunity to earn a "verified certificate", similar to Coursera's Signature Track, with the difference that Udacity, unlike Coursera, no longer offers certificates for non-paying students. (As I've mentioned before, a verified certificate and two dollars may buy you a cup of coffee, but I wouldn't count on its having any greater worth.)

Before I delve into the specifics of this course, let me say that I'm not a real fan of the Udacity interface. While both providers break each lesson up into a series of short videos, Coursera labels each of those videos with a topic, making it relatively easy to go back and find the material you need; by contrast, Udacity strings all the videos for a particular lesson together under a single heading, and so you have to hunt through all of them to find something (you can click on individual videos, and each one has its own label, but you have to click on or hover over a video to see the label). In addition, whenever the video stops for a quiz, it drops out of fullscreen (assuming you're in fullscreen, of course). Moreover, Udacity's discussion forum (note the singular there) has no organization whatsoever, aside from keyword tags—making a search for specific information rather laborious.

Thr first three lessons of this particular course, which features two instructors from Cloudera, are structured in a manner that the director of a music video would appreciate: many of the videos are very short, and switch jarringly from one instructor to the other. Nonetheless, the instructors are engaging, and there's a nice interview with Doug Cutting about how he helped to create Hadoop, and named it after his toddler son's stuffed elephant. The first two lessons, which explain the basics of how Hadoop and HDFS work, can best be described as "lite"—unchallenging nearly to the point of tedium.

Lesson 3 marks an abrupt change: this is where the programming exercises began. The class requires previous experience with Python, which I lacked, and so the exercises took more time for me than they should have, but I managed. One student in the forum questioned whether this was a course on Hadoop, or a course on Python regular expressions, but doing the exercises helped me learn some Python, and, much as I hate the language, it does have a very powerful vocabulary of regular expressions. Unfortunately, the instructor blew by the concept of Hadoop streaming so fast (in Lesson 2) that I wasn't entirely sure for a while what exactly I was doing, though I was managing to get it to work—and once I looked up Hadoop streaming on my own (it is, for the record, an API that allows Hadoop mappers and reducers to be written to be written in any language), I realized that the interface would work just as well for R.

Although the simpler exercises use an online Python compiler, for the exercises that require large datasets, the course's creators deserve kudos for having students install a virtual UNIX box on which a virtual two-machine Hadoop cluster has already been set up, and then manipulate data and write code in this realistic environment. Unfortunately, the exercises that require this virtual machine seem half-baked.

First off, the instructors haven't actually detailed how to write and execute Python scripts on the UNIX machine (the class discussion forum was very helpful here). Second, the syntax needed to make the scripts work is different from the syntax presented in the video lectures (though, fortunately, there are working sample scripts saved on the virtual machine). Third, and most seriously, one particularly tricky exercise requires knowledge that students could not possibly get from the instructions, or, in all probability, the data itself, but could only get from the hints that emerge from a trial-and-error process of submitting answers to the automated grader—it was an interesting little mystery to solve, but there are no automated graders in real life, and so I'm not sure what I gained from the effort.

Yes, figuring out ambiguous instructions does have some pedagogical value, and in the end, completing the exercises was very satisfying, but, especially in the case of the problem that was insoluble without the automated grader, I got the feeling that the difficultes I faced were the result, not of a pedagological choice, but of a simple lack of effort on the part of the instructors—and I felt like I had wasted part of my time.

According to posts in the forum, Lesson 4 was not part of the original class, though I'm not sure if it was planned all along, or tacked on later. To paraphrase Monty Python and the Holy Grail, the course was completed in an entirely different style at great expense and at the last minute. The lectures feature a different intstructor, a Udacity employee, in place of the Cloudera instructors. This lesson covers design patterns, specifically filtering patterns (more regular expressions), summarization patterns (minimums, maximums, and means, for example), and structural patterns (combining data sets); one lecture also deals with combiners, scripts inserted between mappers and reducers to make things more efficient by doing some of the reduction locally on each machine in the cluster.

I found these lectures better than the previous ones, and the exercises better prepared. I will say, though, that I eventually got bored with writing new and different regular expressions in Python, and didn't finish the last few exercises (or the final project, which isn't graded for non-paying students in any case), though I did watch all of the lectures.

In the end, this half-baked pastiche of a course at least gave me a decent idea of how Hadoop works, and removed the mystique of manipulating data stored on a Hadoop cluster. I wouldn't know how to set up a cluster myself (that wasn't the intent of the class, though I don't think it would be all that hard to do), but I do know how to use Hadoop streaming—and I've realized it's not exactly rocket science.

Monday, August 4, 2014

Online Course Review: Exploratory Data Analysis, from Coursera's Data Science Specialization

Back in May, I reviewed two of the short courses that make up Coursera's Data Science specialization. Although the four-week format greatly limits the content of any one course, I was generally impressed by the scientific approach of the specialization (something all too often lacking in data "science"), and, in the case of Getting and Cleaning Data, by the many pointers provided to R packages and sources of information for further study: the course may not have gone into a lot of depth, but it provided a good overview of what you can do with R.

I recently completed a third course in the specialization, Exploratory Data Analysis, taught by Roger D. Peng (the previous courses I took were taught by Jeff Leek). While I enjoy Peng's lecture style (unlike Leek, he engages the audience by showing his face at the beginnings of lectures), and I learned a lot, the course suffers greatly from the short format.

I initially overlooked this class: from the name (more on this in a minute), I never would have guessed that 3/4 of the lectures would cover graphics in R. Peng teaches the basics of the language's three major graphics packages, the base graphics, lattice, and ggplot2. As is the case for Getting and Cleaning Data, the lectures manage only to skim the surface, particularly for ggplot2, but they do give the student a decent idea of what's possible in R. I do though think that for ggplot2 Peng could do a better job of outling the advanced features than simply pointing students to the book written by the package's author, Hadley Wickham (thankfully, it's possible to find free PDF's of the book online, but I'm not sure it's where I'd want to start for solving a discrete problem, rather than studying ggplot2 in a methodical way).

So what's with the name of the course? Peng presents visualization in R as a way of conducting initial exploration of data, but it's obviously useful for more than that, since R can create decent visualizations of the results of analysis. I suspect that the course name was chosen so that one week of lectures on clustering and dimensionality reduction could be shoehorned into the syllabus. This material probably belongs instead in the Pratical Machine Learning course, but something had to be cut to limit that course to four weeks (cf. the nine-week Machine Learning, also offered by Coursera, and which I've reviewed previouslytwice, actually). The fact that clustering and dimensionality reduction can be used for exploratory analysis and visualization is the only thing that ties the entire course together.

What's particular disturbing is the way that all of this combines with the specialization's unique approach to exercises and evaluation. Each course includes a hands-on project, and, because open-ended projects in a MOOC must, for logistical reasons, be graded using a peer grading system, the final project for Exploratory Data Analysis only ends up covering material from the first two weeks of the course, since students need the third week to work on the project, followed by the fourth week to grade it; therefore, half the content of the class doesn't play any role in the project. On top of this—I suppose to avoid overloading students—there's no quiz, homework, or any other form of practice or evaluation covering the material on clustering and dimensionality reduction, which makes it hard for a student to know if he or she really understands those topics.

To sum up, I did find the information on data visualization in R useful, but I would have appreciated a full four weeks on the subject. The coverage of clustering and dimensionality reduction was out of place in the course; nonetheless, many will find it valuable (I had already seen most if not all of it in Machine Learning and another Coursera course, Social Network Analysis, which I've also reviewed).

I do have one more comment, though this applies to the Data Science specialization in general, and to Coursera, rather than solely to this course. Normally, after completing a Coursera course, a student can go back and look at the course archives at any later time; I've found this valuable when I suddenly find myself needing to refresh my memory or find out where I can learn more about a topic. Coursera has apparently disabled this feature for the Data Science courses: their archives are no longer accessible after the grading period is over (about a week after the finish of a course). I say "apparently" because, when I contacted Coursera a few months ago to ask why I could no longer access the archives of Getting and Cleaning Data, I never got a response—this is becoming something of a theme with Coursera, which, as I noted in my second review of Machine Learning, ignores most bug reports for that class. I suppose that paying customers might get better service, but I'm not going to pay just to find out if that's true.

Of course, you can always sign up for the current iteration of a class, since they're offered continuously, but it's annoying to have to do that each month. Fortunately, all of the class materials are also available in a GitHub repository, but it's not as easy to display documents on GitHub as in Coursera's web interface. For a set of courses that only skim the surface, and whose major value is in providing links to deeper information, this is a major failing.

Programming Languages for Big Data, Part 3

And now, one more word on the subject of R's speed. At my prodding, Tommy Jones contacted the authors of the study on programming language speed, and a productive discussion ensued. It turns out that the task in question was one that couldn't be vectorized, which means that R's main strength couldn't be applied in this case. However, it was possible to speed it up by writing C++ functions in R using Rcpp. The authors tried this, and revised their paper, reporting that, using Rcpp, R performed the task only 4-5 times slower than C++. For details, see Tommy's blog post, and the revised paper.

Friday, July 11, 2014

Programming Languages for Big Data, Part 2

I mentioned the recent study on the relative speeds of programming languages to Tommy Jones, a specialist in natural language processing and fellow member of the Data Community DC, and he, being more industrious than I, dove into the code used by the authors of the paper in question. In their R code, he found gems such as a triple-nested "for" loop inside a "while" loop (instead of the much faster "apply" functions), which made the comparisons pretty useless, at least in the case of R. See Tommy's blog, Biased Estimates, for more details.

Nonetheless, it's a pretty interesting question, and I'd love to see someone who's proficient in all of the languages involved try this test again, using better code. I'm still intrigued by the very high speed of MATLAB/Octave—something that leads Andrew Ng to recommend those languages over R for prototyping—though Tommy pointed out to me that, since R is closer to being a full-featured language, it's more flexible than the former languages.

Sunday, July 6, 2014

Programming Languages for Big Data

I'm a big fan of R: it just seems intuitive to me, and there's a package available for practically any type of analysis you might want to do. I have some experience with Octave (essentially the open-source version of proprietary MATLAB) and Python (which I detest, especially with its confusing statements-masquarading-as-function-calls), but I find R easiest to work with.

Therefore, I find a new study comparing the speeds of various languages for a statistical problem pretty depressing. When looking at this kind of a study, it's important to keep one big thing in mind: the authors tested the various languages on only a single task (albeit a common task, at least in economic modeling), and different languages will have different strengths and weaknesses at different tasks.

Nonetheless, the differences in run time are so large that it's not unreasonable to draw some conclusions. Even when compiled, R takes 240 to 340 times as long to run as C++. How about Python and MATLAB? Python with the default CPython compiler is nearly as slow as compiled R (155 to 269 times), but with Pypy it reaches 1/44 of the speed of C++. MATLAB takes only about 10 times as long to run as C++, or only about 50% longer when using Mex files (C, C++, or Fortran subroutines called by MATLAB). (Octave has Oct files written in C++, which serve a similar purpose; Octave can use Mex files, but not as well as MATLAB. See the GNU documentation on the subject for details.)

Wow. The botton line is that R might not be the best choice for time-consuming applications—in other words, those that have to crunch through a lot of data, especially if the calcuations involved are complex. I had read that it's slower than the alternatives, but I had no idea that the differences were so dramatic. I really should polish my Octave skills, and, judging by many of the job ads I see, knowing some C++ would not only open up possibilities for faster-running code, but would also make me more employable.