Thursday, May 8, 2014

Online Course Reviews: The Data Scientist's Toolbox, and Getting and Cleaning Data, from Coursera's Data Science Specialization

I recently completed Coursera's The Data Scientist's Toolbox and Getting and Cleaning Data, two courses that form part of the online learning provider's new Data Science specialization, taught by Brian Caffo, Jeffrey Leek, and Roger D. Peng, biostatistics professors at Johns Hopkins University, and, in the cases of Leek and Peng, authors of the Simply Statistics blog. Both of the courses I took were taught by Jeff Leek (referred to in my earlier post today). I found Getting and Cleaning Data to be an especially useful course, teaching some practical skills that are quite essential to the real-world practice of data science. However, I probably wouldn't recommend the entire specialization to anyone coming from the world of quantitative research in academia, since a big focus of the program is teaching the scientific method and the logic of statistical inference—that is, things a quantitative social scientist should know already. First, however, a little background on Coursera's specializations....

Coursera has recently introduced a handful of "specializations", each consisting of a series of short courses followed by a capstone project. The specializations continue Coursera's effort to monetize its offerings through the Signature Track, which offers a "Verified Certificate" for those who pay a fee (typically about $50-$100) to take the course.

The Signature Track in itself has dubious value. Allegedly, its purpose is to provide a more useful credential than the certificates Coursera has traditionally offered for its free classes. To make the Verified Certificate more useful (that is, more impressive to potential employers), Coursera takes measures to guarantee you did the work yourself, but these measures seem fairly easy to circumvent. Specializations add an additional sweetener: if you take every class in the specialization on the Signature Track, you can then take the capstone project (offered as an additional class), which is not available to students who take the courses for free (or even, for that matter, to students who pay for only some of the courses). Students completing the specialization also receive a specialization certificate.

The Data Science specialization includes 10 short (four-week) classes, including the capstone, each priced at $49 for the Signature Track. If you stump up the whole $490 at once, you can take any of the courses as many times as you like over the next two years (in case you don't pass the first time); if you pay for the courses one at a time, you can only retake each one once (which is probably enough—honestly, if you can't pass one of these classes, you probably don't belong in the profession, but sometimes life gets busy, and you can't finish the work for a class). Each of the first nine courses will be offered once a month; the first six are available already, and the remaining three will be offered for the first time in June. For a couple of the classes, there's also an option to substitute an alternate course on Coursera. The capstone has yet to be schedule (word in the forums has it that it'll be offered in fall), and I'm not sure how often Coursera plans to offer it.

The course that really interested me was Getting and Cleaning Data, but I signed up for The Data Scientist's Toolbox because it's required for the rest of the specialization; R Programming is also required, but I already had some experience with R, and I had no intention of completing the entire specialization, and so I skipped this one. Taking one of the later courses at the same time as the required intro course didn't pose any difficulties for me, but I think that someone who has no experience with R would probably want to complete that class before tackling any of the others.

Much of The Data Scientist's Toolbox is devoted to introducing the topics of the specialization's other eight courses; frankly, you can skip this if you don't intend to take those courses (or possibly, even if you do intend to take them—you will, after all, cover that information later, though if you're taking the whole specialization, you may need to watch the video lectures in question in order to complete the quiz for Week 1). For me, the most useful content of this class was its introductions to Git, GitHub, and RStudio (I had been using the plain old R Console, and RStudio makes things considerably easier). RStudio is required for the programming necessary to complete the assignments in the later courses, and Git and GitHub are necessary to complete the projects at the end of each course (you have to upload your work to GitHub so that other students can peform peer assessments on it). For the sake of full disclosure, let me say that I skipped the introductory lectures in Week 1 of this course (though I did pass all the quizzes), and did not complete the course project, which consisted of taking screenshots to prove that you'd installed Git, GitHub, and RStudio (I installed all three, but I wasn't really concerned with getting the course certificate).

I found Getting and Cleaning Data invaluable. I took the course because I wanted to learn how to get data off the web. For example, in the project I did for Coursera's Social Network Analysis last year, I ended up saving data from several hundred web pages by hand, which is not a particularly efficient way of doing things. Getting and Cleaning Data promises to teach students how to extract data from common data storage formats (including databases, specifically SQL, XML, JSON, and HDF5), and from the web using API's and web scraping. The syllabus also includes tips on using R to clean and recode data, and, in the last lecture, a long list of links to sources of data. It's also worth noting that the style of the video lectures is a bit different from those of other classses I've taken: there's never any video of the instructor, just the instructor's voice over the lecture notes.

Initially, I was skeptical, because most of the lectures amount to little more than a list of R packages, functions (with a few short examples), and links for further information. The information blows past you so fast that there's no hope of remembering much of it. However, the lecture notes (in both HTML5 and PDF—the HMTL5 is a little awkward to navigate, but the links work, unlike in the PDF) provide a wonderful resource that you'll find yourself referring to again and again. I've often found that the hardest part of a project is knowing where to start, and the lectures in Getting and Cleaning Data point you in the right direction; in fact, I'm using information from the lectures on web-scraping and JSON right now to do an updated version of my project for Social Network Analysis, a statistically informed visualization of which cards in the game Android: Netrunner appear together in the decks designed by players. Look for that to be posted here soon!

Among the data science courses that I've taken online, Getting and Cleaning Data is the first one that taught me how to go out and get data and then put it in a form that's usable for analysis. By contrast, Coursera's Machine Learning, taught by Stanford's Andrew Ng, provides highly practical advice on selecting and using algorithms, but does so uses very much canned programming exercises, in which the data has already been collected and processed. In fact, the two course are highly complementary, at least inasmuch as they give you ideas about how to handle different stages of a data science project. It should though be noted that Machine Learning uses Octave (essentially the open-source version of MATLAB) rather than R; the Data Science specialization includes its own (much shorter) Practical Machine Learning course, as well as an earlier course on Regression Models that delves far more deeply into that topic than does Machine Learning.

I should add that, for this class too, I never completed the final project: it looks like a highly practical exercise, but I was short on time, and more interested in my own project; again, I didn't care much about earning a certificate, with my main concern being to learn the nuts and bolts of getting data from the web.

Finally, let me offer a few comments on the Data Science specialization as a whole. I would not recommend completing the entire specialization for anyone who's well-versed in statistics and the scientfic method: if you're a competent social scientist (as opposed to someone who took one stats course as an undergraduate), you already understand important issues like sampling, causal inference, and reproducibility (though, admittedly, I've read more than a few articles by social scientists who evidently had shaky grasps on these concepts). For a specialization that labels itself as "Data Science", there's also scant coverage of databases. That being said, anyone interested in data science might find Getting and Cleaning Data, R Programming, and Practical Machine Learning useful, and for someone who doesn't have a background as a quantitative researcher, I can't recommend this specialization's focus on the scientific method and applied statistics highly enough.


  1. " ... someone who took one stats course as an undergraduate... "

    lol, that's me. currently on the signature track. i'm also completing a seperate database administration online course (from o'reilly), that should make up for the lack of database focus from coursera. thanks for the insightful review.

  2. Thanks for the comment, Gene! If you could tell me where to find that course, maybe I can add it to my links page. I also can't recommend highly enough the Introduction to Databases course from Jennifer Widom of Stanford (see my review here). The full form is now available on Coursera as "self study" (that is, you can start the course any time you want, and there are no deadlines). Stanford Online is also offering a renamed version (called "About Databases") on its own platform, as a series of 14 mini-courses covering each of the original course's topics.

  3. Whoops, since I wrote the above comment, Coursera has taken down its version of the course—that makes the decision to take the Stanford Online version (which is actually called simply "Databases") easy.

  4. Can you please elaborate following, what do you mean by this...
    "...and for someone who doesn't have a background as a quantitative researcher, I can't recommend this specialization's focus on the scientific and applied statistics highly enough."

  5. First of all, let me correct a typo: that should be "scientific method"--I just edited the post to fix it.

    There are two things required to use statistics effectively. The first is the mathematical part, the statistics themselves. A scientist doesn't really need to understand all mathematical derivations of statistical formulas (unless he wants to develop new methods), but he does need to have a qualitative understanding of what all those funny Greek letter mean, and to understand the assumptions behind each formula, and so on, in order to make sure he's using the right formula in the right situation.

    The second part required to use statistics effectively is the scientific method. We've all been taught the scientific method in school, and used it to perform experiments in chemistry and physics classes. However, once you start dealing with complex, probabilitistic systems, you can't simply do an experiment and expect the same results every time, and this is as true in physics, ecology, and geology, as it is in pscyhology and political science. Sometimes you can't do an experiment at all (you can't re-run the Big Bang, for instance), and even when you can run a real experiment, you won't have the same thing happen every time, and you'll need to understand statistics in order to sort out your results. There are many, many concepts that come into play in making causal inference for this type of research: sampling, controls, multiple causation, quasi-experimental design, and so on. It just takes a certain amount of training and experience to grasp all these.

    Most college or university graduates have been at least exposed to both parts of the statistical equation: they may have had one class in stats, and some of their classes in social science, physics, or some other subject have made reference to research that used statistical analysis. Many engineers and computer scientists will even understand the math behind the stats pretty well. However, most university graduates will not have much practical experience in analyzing, designing, and carrying out statistical research, and these people, I think, could benefit quite a bit from the parts of the Data Science specialization that expore how a scientist uses statistics in practice.

  6. Great review. I am currently taking this certification and found it good, but the teacher differences are bothering me. I also found this review of the John's Hopkins / Coursera Data Science certification I agree with here:

  7. "honestly, if you can't pass one of these classes, you probably don't belong in the profession"

    What a rude way to put it.

    Good review otherwise.

    1. I'm deleting my original reply and replacing it with this one, because my original was muddled and a little tetchy (I'm going to blame fatigue).

      I could probably have said it diplomatically, but the point needs to be made: if you can't pass these classes, you should stop trying to become a data scientist.

      Data science is a field that's received a lot of attention. Because of that, and because of the high pay, it's attracted a lot of people who just don't have the talent for it.

      It's true that you don't have to be a "unicorn" in order to succeed in data science--you don't have to be an expert at statistics AND coding AND visualization AND communicating well with business executives. However, you do need at least to understand the basics of all of these things, so that you can cooperate with the people who provide the skills that complement your own--and the Coursera specialization, while quite good, covers just the basics. Therefore, if you can't get through the Coursera courses, you're probably not cut out for being a data scientist.

  8. Did you finish the capstone?

    1. Aside from the issue of cost (you can only take the capstone if you pay for the Specialization), I never planned on taking all the courses in the specialization--a lot of them covered material I was already familiar with.

  9. Having learnt the techniques of data acquiring, cleaning and concentration, is it possible to gain practical experience in the same by working for a startup/company for a short span(30-45days internship)??

    1. If you can get someone to offer you that sort of a position, it would certainly be useful. Any kind of real-world experience with using data science methods is going to increase your skill level and make your resume look better: there are many ways to do that, including free-lance work and competitions like Kaggle. I do wonder though if a period as short as 30-45 days is going to let you truly sink your teeth into one or more real projects.