Thursday, May 8, 2014

Online Course Reviews: The Data Scientist's Toolbox, and Getting and Cleaning Data, from Coursera's Data Science Specialization

I recently completed Coursera's The Data Scientist's Toolbox and Getting and Cleaning Data, two courses that form part of the online learning provider's new Data Science specialization, taught by Brian Caffo, Jeffrey Leek, and Roger D. Peng, biostatistics professors at Johns Hopkins University, and, in the cases of Leek and Peng, authors of the Simply Statistics blog. Both of the courses I took were taught by Jeff Leek (referred to in my earlier post today). I found Getting and Cleaning Data to be an especially useful course, teaching some practical skills that are quite essential to the real-world practice of data science. However, I probably wouldn't recommend the entire specialization to anyone coming from the world of quantitative research in academia, since a big focus of the program is teaching the scientific method and the logic of statistical inference—that is, things a quantitative social scientist should know already. First, however, a little background on Coursera's specializations....

Coursera has recently introduced a handful of "specializations", each consisting of a series of short courses followed by a capstone project. The specializations continue Coursera's effort to monetize its offerings through the Signature Track, which offers a "Verified Certificate" for those who pay a fee (typically about $50-$100) to take the course.

The Signature Track in itself has dubious value. Allegedly, its purpose is to provide a more useful credential than the certificates Coursera has traditionally offered for its free classes. To make the Verified Certificate more useful (that is, more impressive to potential employers), Coursera takes measures to guarantee you did the work yourself, but these measures seem fairly easy to circumvent. Specializations add an additional sweetener: if you take every class in the specialization on the Signature Track, you can then take the capstone project (offered as an additional class), which is not available to students who take the courses for free (or even, for that matter, to students who pay for only some of the courses). Students completing the specialization also receive a specialization certificate.

The Data Science specialization includes 10 short (four-week) classes, including the capstone, each priced at $49 for the Signature Track. If you stump up the whole $490 at once, you can take any of the courses as many times as you like over the next two years (in case you don't pass the first time); if you pay for the courses one at a time, you can only retake each one once (which is probably enough—honestly, if you can't pass one of these classes, you probably don't belong in the profession, but sometimes life gets busy, and you can't finish the work for a class). Each of the first nine courses will be offered once a month; the first six are available already, and the remaining three will be offered for the first time in June. For a couple of the classes, there's also an option to substitute an alternate course on Coursera. The capstone has yet to be schedule (word in the forums has it that it'll be offered in fall), and I'm not sure how often Coursera plans to offer it.

The course that really interested me was Getting and Cleaning Data, but I signed up for The Data Scientist's Toolbox because it's required for the rest of the specialization; R Programming is also required, but I already had some experience with R, and I had no intention of completing the entire specialization, and so I skipped this one. Taking one of the later courses at the same time as the required intro course didn't pose any difficulties for me, but I think that someone who has no experience with R would probably want to complete that class before tackling any of the others.

Much of The Data Scientist's Toolbox is devoted to introducing the topics of the specialization's other eight courses; frankly, you can skip this if you don't intend to take those courses (or possibly, even if you do intend to take them—you will, after all, cover that information later, though if you're taking the whole specialization, you may need to watch the video lectures in question in order to complete the quiz for Week 1). For me, the most useful content of this class was its introductions to Git, GitHub, and RStudio (I had been using the plain old R Console, and RStudio makes things considerably easier). RStudio is required for the programming necessary to complete the assignments in the later courses, and Git and GitHub are necessary to complete the projects at the end of each course (you have to upload your work to GitHub so that other students can peform peer assessments on it). For the sake of full disclosure, let me say that I skipped the introductory lectures in Week 1 of this course (though I did pass all the quizzes), and did not complete the course project, which consisted of taking screenshots to prove that you'd installed Git, GitHub, and RStudio (I installed all three, but I wasn't really concerned with getting the course certificate).

I found Getting and Cleaning Data invaluable. I took the course because I wanted to learn how to get data off the web. For example, in the project I did for Coursera's Social Network Analysis last year, I ended up saving data from several hundred web pages by hand, which is not a particularly efficient way of doing things. Getting and Cleaning Data promises to teach students how to extract data from common data storage formats (including databases, specifically SQL, XML, JSON, and HDF5), and from the web using API's and web scraping. The syllabus also includes tips on using R to clean and recode data, and, in the last lecture, a long list of links to sources of data. It's also worth noting that the style of the video lectures is a bit different from those of other classses I've taken: there's never any video of the instructor, just the instructor's voice over the lecture notes.

Initially, I was skeptical, because most of the lectures amount to little more than a list of R packages, functions (with a few short examples), and links for further information. The information blows past you so fast that there's no hope of remembering much of it. However, the lecture notes (in both HTML5 and PDF—the HMTL5 is a little awkward to navigate, but the links work, unlike in the PDF) provide a wonderful resource that you'll find yourself referring to again and again. I've often found that the hardest part of a project is knowing where to start, and the lectures in Getting and Cleaning Data point you in the right direction; in fact, I'm using information from the lectures on web-scraping and JSON right now to do an updated version of my project for Social Network Analysis, a statistically informed visualization of which cards in the game Android: Netrunner appear together in the decks designed by players. Look for that to be posted here soon!

Among the data science courses that I've taken online, Getting and Cleaning Data is the first one that taught me how to go out and get data and then put it in a form that's usable for analysis. By contrast, Coursera's Machine Learning, taught by Stanford's Andrew Ng, provides highly practical advice on selecting and using algorithms, but does so uses very much canned programming exercises, in which the data has already been collected and processed. In fact, the two course are highly complementary, at least inasmuch as they give you ideas about how to handle different stages of a data science project. It should though be noted that Machine Learning uses Octave (essentially the open-source version of MATLAB) rather than R; the Data Science specialization includes its own (much shorter) Practical Machine Learning course, as well as an earlier course on Regression Models that delves far more deeply into that topic than does Machine Learning.

I should add that, for this class too, I never completed the final project: it looks like a highly practical exercise, but I was short on time, and more interested in my own project; again, I didn't care much about earning a certificate, with my main concern being to learn the nuts and bolts of getting data from the web.

Finally, let me offer a few comments on the Data Science specialization as a whole. I would not recommend completing the entire specialization for anyone who's well-versed in statistics and the scientfic method: if you're a competent social scientist (as opposed to someone who took one stats course as an undergraduate), you already understand important issues like sampling, causal inference, and reproducibility (though, admittedly, I've read more than a few articles by social scientists who evidently had shaky grasps on these concepts). For a specialization that labels itself as "Data Science", there's also scant coverage of databases. That being said, anyone interested in data science might find Getting and Cleaning Data, R Programming, and Practical Machine Learning useful, and for someone who doesn't have a background as a quantitative researcher, I can't recommend this specialization's focus on the scientific method and applied statistics highly enough.

32 comments:

  1. " ... someone who took one stats course as an undergraduate... "

    lol, that's me. currently on the signature track. i'm also completing a seperate database administration online course (from o'reilly), that should make up for the lack of database focus from coursera. thanks for the insightful review.

    ReplyDelete
  2. Thanks for the comment, Gene! If you could tell me where to find that course, maybe I can add it to my links page. I also can't recommend highly enough the Introduction to Databases course from Jennifer Widom of Stanford (see my review here). The full form is now available on Coursera as "self study" (that is, you can start the course any time you want, and there are no deadlines). Stanford Online is also offering a renamed version (called "About Databases") on its own platform, as a series of 14 mini-courses covering each of the original course's topics.

    ReplyDelete
  3. Whoops, since I wrote the above comment, Coursera has taken down its version of the course—that makes the decision to take the Stanford Online version (which is actually called simply "Databases") easy.

    ReplyDelete
  4. Can you please elaborate following, what do you mean by this...
    "...and for someone who doesn't have a background as a quantitative researcher, I can't recommend this specialization's focus on the scientific and applied statistics highly enough."

    ReplyDelete
  5. First of all, let me correct a typo: that should be "scientific method"--I just edited the post to fix it.

    There are two things required to use statistics effectively. The first is the mathematical part, the statistics themselves. A scientist doesn't really need to understand all mathematical derivations of statistical formulas (unless he wants to develop new methods), but he does need to have a qualitative understanding of what all those funny Greek letter mean, and to understand the assumptions behind each formula, and so on, in order to make sure he's using the right formula in the right situation.

    The second part required to use statistics effectively is the scientific method. We've all been taught the scientific method in school, and used it to perform experiments in chemistry and physics classes. However, once you start dealing with complex, probabilitistic systems, you can't simply do an experiment and expect the same results every time, and this is as true in physics, ecology, and geology, as it is in pscyhology and political science. Sometimes you can't do an experiment at all (you can't re-run the Big Bang, for instance), and even when you can run a real experiment, you won't have the same thing happen every time, and you'll need to understand statistics in order to sort out your results. There are many, many concepts that come into play in making causal inference for this type of research: sampling, controls, multiple causation, quasi-experimental design, and so on. It just takes a certain amount of training and experience to grasp all these.

    Most college or university graduates have been at least exposed to both parts of the statistical equation: they may have had one class in stats, and some of their classes in social science, physics, or some other subject have made reference to research that used statistical analysis. Many engineers and computer scientists will even understand the math behind the stats pretty well. However, most university graduates will not have much practical experience in analyzing, designing, and carrying out statistical research, and these people, I think, could benefit quite a bit from the parts of the Data Science specialization that expore how a scientist uses statistics in practice.

    ReplyDelete
  6. "honestly, if you can't pass one of these classes, you probably don't belong in the profession"

    What a rude way to put it.

    Good review otherwise.

    ReplyDelete
    Replies
    1. I'm deleting my original reply and replacing it with this one, because my original was muddled and a little tetchy (I'm going to blame fatigue).

      I could probably have said it diplomatically, but the point needs to be made: if you can't pass these classes, you should stop trying to become a data scientist.

      Data science is a field that's received a lot of attention. Because of that, and because of the high pay, it's attracted a lot of people who just don't have the talent for it.

      It's true that you don't have to be a "unicorn" in order to succeed in data science--you don't have to be an expert at statistics AND coding AND visualization AND communicating well with business executives. However, you do need at least to understand the basics of all of these things, so that you can cooperate with the people who provide the skills that complement your own--and the Coursera specialization, while quite good, covers just the basics. Therefore, if you can't get through the Coursera courses, you're probably not cut out for being a data scientist.

      Delete
  7. Replies
    1. Aside from the issue of cost (you can only take the capstone if you pay for the Specialization), I never planned on taking all the courses in the specialization--a lot of them covered material I was already familiar with.

      Delete
  8. Having learnt the techniques of data acquiring, cleaning and concentration, is it possible to gain practical experience in the same by working for a startup/company for a short span(30-45days internship)??

    ReplyDelete
    Replies
    1. If you can get someone to offer you that sort of a position, it would certainly be useful. Any kind of real-world experience with using data science methods is going to increase your skill level and make your resume look better: there are many ways to do that, including free-lance work and competitions like Kaggle. I do wonder though if a period as short as 30-45 days is going to let you truly sink your teeth into one or more real projects.

      Delete
  9. This comment has been removed by the author.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.

    Best PHP Training Institute in Chennai|PHP Course in chennai
    Best .Net Training Institute in Chennai
    Oracle DBA Training in Chennai
    RPA Training in Chennai
    UIpath Training in Chennai

    ReplyDelete
  13. Data science is one of the top course in todays career. Your content will going to helpful for all the beginners and professionals. Ours is a training institute which provides best data science training in bangalore and many other courses.

    ReplyDelete
  14. Really i am Enjoy Reading all the Articles...Thanks for Such an Interesting Information's and waiting to read many more Articles like this....Click below more
    Java training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery

    ReplyDelete
  15. very well explained. I would like to thank you for the efforts you had made for writing this awesome article. This article inspired me to read more. keep it up.Java Training in Chennai

    Java Training in Velachery

    Java Training inTambaram

    Java Training in Porur

    Java Training in Omr

    Java Training in Annanagar


    ReplyDelete
  16. Good job in presenting the correct content with the clear explanation. The content looks real with valid information. Good Work

    DevOps is currently a popular model currently organizations all over the world moving towards to it. Your post gave a clear idea about knowing the DevOps model and its importance.

    Good to learn about DevOps at this time.

    DevOps Training in Chennai

    DevOps Course in Chennai

    ReplyDelete
  17. Getting into Integrated Marketing is tough if you don’t have thorough knowledge. Then why not join Talentedge, the first ed-tech platform that has joined hands with XLRI and MICA to provide the best courses to the students.

    ReplyDelete
  18. Data Science courses in Delhi offers the best training through live projects from the industry professionals. visit now for more consultation.

    ReplyDelete
  19. Harrah's Philadelphia Casino & Racetrack - JetBlue
    Harrah's Philadelphia Casino & Racetrack Philadelphia is a member 구리 출장샵 of JetBlue Casino 안동 출장샵 Resorts. 사천 출장마사지 Harrah's Philadelphia Casino 삼척 출장안마 & 당진 출장마사지 Racetrack features 1,250 slot

    ReplyDelete
  20. Nice article,
    anyone looking for full stack developer course, there is a No1 training institute in Bangalore.
    for more details:https://pentagonspace.in/python-full-stack

    ReplyDelete