Back in May, I reviewed two of the short courses that make up Coursera's Data Science specialization. Although the four-week format greatly limits the content of any one course, I was generally impressed by the scientific approach of the specialization (something all too often lacking in data "science"), and, in the case of Getting and Cleaning Data, by the many pointers provided to R packages and sources of information for further study: the course may not have gone into a lot of depth, but it provided a good overview of what you can do with R.
I recently completed a third course in the specialization, Exploratory Data Analysis, taught by Roger D. Peng (the previous courses I took were taught by Jeff Leek). While I enjoy Peng's lecture style (unlike Leek, he engages the audience by showing his face at the beginnings of lectures), and I learned a lot, the course suffers greatly from the short format.
I initially overlooked this class: from the name (more on this in a minute), I never would have guessed that 3/4 of the lectures would cover graphics in R. Peng teaches the basics of the language's three major graphics packages, the base graphics, lattice, and ggplot2. As is the case for Getting and Cleaning Data, the lectures manage only to skim the surface, particularly for ggplot2, but they do give the student a decent idea of what's possible in R. I do though think that for ggplot2 Peng could do a better job of outling the advanced features than simply pointing students to the book written by the package's author, Hadley Wickham (thankfully, it's possible to find free PDF's of the book online, but I'm not sure it's where I'd want to start for solving a discrete problem, rather than studying ggplot2 in a methodical way).
So what's with the name of the course? Peng presents visualization in R as a way of conducting initial exploration of data, but it's obviously useful for more than that, since R can create decent visualizations of the results of analysis. I suspect that the course name was chosen so that one week of lectures on clustering and dimensionality reduction could be shoehorned into the syllabus. This material probably belongs instead in the Pratical Machine Learning course, but something had to be cut to limit that course to four weeks (cf. the nine-week Machine Learning, also offered by Coursera, and which I've reviewed previously—twice, actually). The fact that clustering and dimensionality reduction can be used for exploratory analysis and visualization is the only thing that ties the entire course together.
What's particular disturbing is the way that all of this combines with the specialization's unique approach to exercises and evaluation. Each course includes a hands-on project, and, because open-ended projects in a MOOC must, for logistical reasons, be graded using a peer grading system, the final project for Exploratory Data Analysis only ends up covering material from the first two weeks of the course, since students need the third week to work on the project, followed by the fourth week to grade it; therefore, half the content of the class doesn't play any role in the project. On top of this—I suppose to avoid overloading students—there's no quiz, homework, or any other form of practice or evaluation covering the material on clustering and dimensionality reduction, which makes it hard for a student to know if he or she really understands those topics.
To sum up, I did find the information on data visualization in R useful, but I would have appreciated a full four weeks on the subject. The coverage of clustering and dimensionality reduction was out of place in the course; nonetheless, many will find it valuable (I had already seen most if not all of it in Machine Learning and another Coursera course, Social Network Analysis, which I've also reviewed).
I do have one more comment, though this applies to the Data Science specialization in general, and to Coursera, rather than solely to this course. Normally, after completing a Coursera course, a student can go back and look at the course archives at any later time; I've found this valuable when I suddenly find myself needing to refresh my memory or find out where I can learn more about a topic. Coursera has apparently disabled this feature for the Data Science courses: their archives are no longer accessible after the grading period is over (about a week after the finish of a course). I say "apparently" because, when I contacted Coursera a few months ago to ask why I could no longer access the archives of Getting and Cleaning Data, I never got a response—this is becoming something of a theme with Coursera, which, as I noted in my second review of Machine Learning, ignores most bug reports for that class. I suppose that paying customers might get better service, but I'm not going to pay just to find out if that's true.
Of course, you can always sign up for the current iteration of a class, since they're offered continuously, but it's annoying to have to do that each month. Fortunately, all of the class materials are also available in a GitHub repository, but it's not as easy to display documents on GitHub as in Coursera's web interface. For a set of courses that only skim the surface, and whose major value is in providing links to deeper information, this is a major failing.
Monday, August 4, 2014
Programming Languages for Big Data, Part 3
And now, one more word on the subject of R's speed. At my prodding, Tommy Jones contacted the authors of the study on programming language speed, and a productive discussion ensued. It turns out that the task in question was one that couldn't be vectorized, which means that R's main strength couldn't be applied in this case. However, it was possible to speed it up by writing C++ functions in R using Rcpp. The authors tried this, and revised their paper, reporting that, using Rcpp, R performed the task only 4-5 times slower than C++. For details, see Tommy's blog post, and the revised paper.
Subscribe to:
Posts (Atom)