Thursday, May 29, 2014

Online Course Review: Coursera's Machine Learning, Part 2

Back in October, I reviewed Coursera's Machine Learning course, taught by Stanford professor and Coursera co-founder Andrew Ng. As I mentioned when I first reviewed the course, I wasn't able to finish it, because I was starting a new job and moving halfway across the country. I've just been able to complete the most recent iteration of Machine Learning (which was nearly identical to earlier versions), and I'd like to add a few more thoughts to my original review.

My last time through the course (the session that began on April 22nd, 2013), I completed almost all of the lessons on supervised machine learning methods, such as regression, logistic regression, and neural networks. In this session (which began on March 3rd, 2014), I repeated those lessons, and also finished the rest, most of which covered unsupervised techniques, such as clustering and recommender systems. I won't repeat the contents of my earlier review, except to note that what I said then remains true: Andrew Ng is a clear and charismatic lecturer, he covers advanced techniques, and he provides a number of practical tips, but the programming exercises are a bit canned, and may not fully prepare students to write their own scripts in Octave.

My new comments mostly reflect comparisons to other MOOC's, particularly the two courses from Coursera's Data Science specialization that I took recently. First of all, I think that Machine Learning could do more with the online format. In fact, most MOOC's consist largely of video-recorded lectures, with the addition of a sprinkling of interactive content, but Machine Learning falls short even by comparison with other online courses. The class does feature a very effective automatic grader, but it lacks any links to additional resources, or, very importantly, notes or slides from the lectures. While the latter omission may seem trivial (I didn't notice it the first time I took the course), a lack of lecture notes makes it difficult to go back later and review material from a lecture, except by watching the whole thing again. It's true that the programming exercises include detailed instructions, but not all of the course's topics are covered by these exercises, and at any rate the organization of the instructions can make it difficult to locate information on a specific subject.

I might also amplify my comment from the earlier review that the programming exercises involve mostly copying and pasting, rather than writing entire scripts. There's a reason for this: the focus of the course is on algorithms, not on other parts of solving machine learning problems. Nonetheless, my experiences taking other courses, especially those from the Data Science specialization, have demonstrated the practical value of forcing students to think about the nuts and bolts of a research project. Machine Learning's lack of a big final project also arguably deprives students of valuable practical experience, especially since these projects usually require students to explore the course material in greater depth than do short exercises; on the other hand, the fact that a final project can only cover a single topic from the course—or at most a handful of them—calls the value of such projects into question.

My final concern is that Machine Learning seems to have gone on autopilot at this point, with little or no attention from Ng or anyone else who helped him prepare the course materials. Questions in the discussion forum are answered instead by "Community TA's", that is, volunteers who took earlier sessions of the course. Most disturbingly, the majority of reports of errors in the course materials go unanswered, and those that are answered are answered by Community TA's, who lack the ability to fix the errors. For example, a month ago I discovered that the automatic grader accepted one version of my code and rejected another, even though the two versions were algebraically equivalent. My report of this apparent bug still hasn't been answered.

Despite these concerns, I still heartily recommend Machine Learning as a valuable starting point for anyone interested in data science. While the course was offered twice in 2013, the start date of the next iteration, on June 16th, 2014, suggests that Coursera may be planning to offer sessions of the 10-week course almost back-to-back, meaning several sessions each year.

What's next for me? I'll soon be posting a review of Udacity's short Intro to Hadoop and MapReduce. After that, I'm considering taking two more courses from the Data Science specialization, first Exploratory Data Analysis, which will give me some practical experience with graphics programing in R, and then Practical Machine Learning, which will provide experience using R for machine learning, as well as a basis for comparing the machine learning course reviewed above (though the course for the Data Science specialization, at four weeks, is much shorter, and can't possibly cover the same ground).

In the meantime, while I'm still looking for work as a data scientist, I've had a number of interviews, and some of the potential employers have read and commented positively on this blog. I hope that provides an example for other social scientists out there that, yes, you can become a data scientist.

Thursday, May 8, 2014

Online Course Reviews: The Data Scientist's Toolbox, and Getting and Cleaning Data, from Coursera's Data Science Specialization

I recently completed Coursera's The Data Scientist's Toolbox and Getting and Cleaning Data, two courses that form part of the online learning provider's new Data Science specialization, taught by Brian Caffo, Jeffrey Leek, and Roger D. Peng, biostatistics professors at Johns Hopkins University, and, in the cases of Leek and Peng, authors of the Simply Statistics blog. Both of the courses I took were taught by Jeff Leek (referred to in my earlier post today). I found Getting and Cleaning Data to be an especially useful course, teaching some practical skills that are quite essential to the real-world practice of data science. However, I probably wouldn't recommend the entire specialization to anyone coming from the world of quantitative research in academia, since a big focus of the program is teaching the scientific method and the logic of statistical inference—that is, things a quantitative social scientist should know already. First, however, a little background on Coursera's specializations....

Coursera has recently introduced a handful of "specializations", each consisting of a series of short courses followed by a capstone project. The specializations continue Coursera's effort to monetize its offerings through the Signature Track, which offers a "Verified Certificate" for those who pay a fee (typically about $50-$100) to take the course.

The Signature Track in itself has dubious value. Allegedly, its purpose is to provide a more useful credential than the certificates Coursera has traditionally offered for its free classes. To make the Verified Certificate more useful (that is, more impressive to potential employers), Coursera takes measures to guarantee you did the work yourself, but these measures seem fairly easy to circumvent. Specializations add an additional sweetener: if you take every class in the specialization on the Signature Track, you can then take the capstone project (offered as an additional class), which is not available to students who take the courses for free (or even, for that matter, to students who pay for only some of the courses). Students completing the specialization also receive a specialization certificate.

The Data Science specialization includes 10 short (four-week) classes, including the capstone, each priced at $49 for the Signature Track. If you stump up the whole $490 at once, you can take any of the courses as many times as you like over the next two years (in case you don't pass the first time); if you pay for the courses one at a time, you can only retake each one once (which is probably enough—honestly, if you can't pass one of these classes, you probably don't belong in the profession, but sometimes life gets busy, and you can't finish the work for a class). Each of the first nine courses will be offered once a month; the first six are available already, and the remaining three will be offered for the first time in June. For a couple of the classes, there's also an option to substitute an alternate course on Coursera. The capstone has yet to be schedule (word in the forums has it that it'll be offered in fall), and I'm not sure how often Coursera plans to offer it.

The course that really interested me was Getting and Cleaning Data, but I signed up for The Data Scientist's Toolbox because it's required for the rest of the specialization; R Programming is also required, but I already had some experience with R, and I had no intention of completing the entire specialization, and so I skipped this one. Taking one of the later courses at the same time as the required intro course didn't pose any difficulties for me, but I think that someone who has no experience with R would probably want to complete that class before tackling any of the others.

Much of The Data Scientist's Toolbox is devoted to introducing the topics of the specialization's other eight courses; frankly, you can skip this if you don't intend to take those courses (or possibly, even if you do intend to take them—you will, after all, cover that information later, though if you're taking the whole specialization, you may need to watch the video lectures in question in order to complete the quiz for Week 1). For me, the most useful content of this class was its introductions to Git, GitHub, and RStudio (I had been using the plain old R Console, and RStudio makes things considerably easier). RStudio is required for the programming necessary to complete the assignments in the later courses, and Git and GitHub are necessary to complete the projects at the end of each course (you have to upload your work to GitHub so that other students can peform peer assessments on it). For the sake of full disclosure, let me say that I skipped the introductory lectures in Week 1 of this course (though I did pass all the quizzes), and did not complete the course project, which consisted of taking screenshots to prove that you'd installed Git, GitHub, and RStudio (I installed all three, but I wasn't really concerned with getting the course certificate).

I found Getting and Cleaning Data invaluable. I took the course because I wanted to learn how to get data off the web. For example, in the project I did for Coursera's Social Network Analysis last year, I ended up saving data from several hundred web pages by hand, which is not a particularly efficient way of doing things. Getting and Cleaning Data promises to teach students how to extract data from common data storage formats (including databases, specifically SQL, XML, JSON, and HDF5), and from the web using API's and web scraping. The syllabus also includes tips on using R to clean and recode data, and, in the last lecture, a long list of links to sources of data. It's also worth noting that the style of the video lectures is a bit different from those of other classses I've taken: there's never any video of the instructor, just the instructor's voice over the lecture notes.

Initially, I was skeptical, because most of the lectures amount to little more than a list of R packages, functions (with a few short examples), and links for further information. The information blows past you so fast that there's no hope of remembering much of it. However, the lecture notes (in both HTML5 and PDF—the HMTL5 is a little awkward to navigate, but the links work, unlike in the PDF) provide a wonderful resource that you'll find yourself referring to again and again. I've often found that the hardest part of a project is knowing where to start, and the lectures in Getting and Cleaning Data point you in the right direction; in fact, I'm using information from the lectures on web-scraping and JSON right now to do an updated version of my project for Social Network Analysis, a statistically informed visualization of which cards in the game Android: Netrunner appear together in the decks designed by players. Look for that to be posted here soon!

Among the data science courses that I've taken online, Getting and Cleaning Data is the first one that taught me how to go out and get data and then put it in a form that's usable for analysis. By contrast, Coursera's Machine Learning, taught by Stanford's Andrew Ng, provides highly practical advice on selecting and using algorithms, but does so uses very much canned programming exercises, in which the data has already been collected and processed. In fact, the two course are highly complementary, at least inasmuch as they give you ideas about how to handle different stages of a data science project. It should though be noted that Machine Learning uses Octave (essentially the open-source version of MATLAB) rather than R; the Data Science specialization includes its own (much shorter) Practical Machine Learning course, as well as an earlier course on Regression Models that delves far more deeply into that topic than does Machine Learning.

I should add that, for this class too, I never completed the final project: it looks like a highly practical exercise, but I was short on time, and more interested in my own project; again, I didn't care much about earning a certificate, with my main concern being to learn the nuts and bolts of getting data from the web.

Finally, let me offer a few comments on the Data Science specialization as a whole. I would not recommend completing the entire specialization for anyone who's well-versed in statistics and the scientfic method: if you're a competent social scientist (as opposed to someone who took one stats course as an undergraduate), you already understand important issues like sampling, causal inference, and reproducibility (though, admittedly, I've read more than a few articles by social scientists who evidently had shaky grasps on these concepts). For a specialization that labels itself as "Data Science", there's also scant coverage of databases. That being said, anyone interested in data science might find Getting and Cleaning Data, R Programming, and Practical Machine Learning useful, and for someone who doesn't have a background as a quantitative researcher, I can't recommend this specialization's focus on the scientific method and applied statistics highly enough.

Why Data Science Needs Statistics

If you've read my earlier posts about why a scientific approach is important to data science, you won't find it surprising that I recommend Jeff Leek's recent post on the Simply Statistics blog, "Why Big Data Is in Trouble: They Forgot about Applied Statistics". Leek, a biostatistics professor at Johns Hopkins, and one of the instructors in Coursera's Data Science specialization, argues that a number of recent big data failures, including that of Google Flu Trends, can be chalked up to a lack of statistical knowledge among the researchers in question. Leek cites sampling, data collection, causal logic, model specification, and sensitivity analysis as areas where a solid knowledge of applied statistics could have prevented serious errors. It's a short but cogent read.