Saturday, September 13, 2014
The Ethical Challenge of "Passive Predation" in Data Science: Can Data Science Provide the Solution, and Not Just the Problem?
Malek perhaps overdraws his point in the middle part of the post—a historical account of the dehumanizing effects of technology that's reminiscent of Marx's condemnation of working for money in "The Alienation of Labor"—but his main argument is quite sound, and not a little scary.
I wonder, though, if data science itself could provide a solution to this problem. I hereby announce a very unofficial contest, with prizes that will prove trivial at best (I might take a winner out to lunch, or talk about his or her idea at a Data Comunity DC meetup). Pretty much any method of accomplishing this goal, technical or non-technical, is fair game. Any takers?
Thursday, September 11, 2014
Online Course Review: Udacity's Intro to Hadoop and MapReduce
The four-lesson course (short by Udacity standards) is supposed to take about a month to complete—like all Udacity course, and unlike those of Coursera, this is not a true MOOC, taken alongside other students in real time, but rather an interactive tutorial. However, Udacity's model does feature student discussion forums; customers who pay (at the rate of $150/month) also get help from live coaches, feedback on their final projects, and the opportunity to earn a "verified certificate", similar to Coursera's Signature Track, with the difference that Udacity, unlike Coursera, no longer offers certificates for non-paying students. (As I've mentioned before, a verified certificate and two dollars may buy you a cup of coffee, but I wouldn't count on its having any greater worth.)
Before I delve into the specifics of this course, let me say that I'm not a real fan of the Udacity interface. While both providers break each lesson up into a series of short videos, Coursera labels each of those videos with a topic, making it relatively easy to go back and find the material you need; by contrast, Udacity strings all the videos for a particular lesson together under a single heading, and so you have to hunt through all of them to find something (you can click on individual videos, and each one has its own label, but you have to click on or hover over a video to see the label). In addition, whenever the video stops for a quiz, it drops out of fullscreen (assuming you're in fullscreen, of course). Moreover, Udacity's discussion forum (note the singular there) has no organization whatsoever, aside from keyword tags—making a search for specific information rather laborious.
Thr first three lessons of this particular course, which features two instructors from Cloudera, are structured in a manner that the director of a music video would appreciate: many of the videos are very short, and switch jarringly from one instructor to the other. Nonetheless, the instructors are engaging, and there's a nice interview with Doug Cutting about how he helped to create Hadoop, and named it after his toddler son's stuffed elephant. The first two lessons, which explain the basics of how Hadoop and HDFS work, can best be described as "lite"—unchallenging nearly to the point of tedium.
Lesson 3 marks an abrupt change: this is where the programming exercises began. The class requires previous experience with Python, which I lacked, and so the exercises took more time for me than they should have, but I managed. One student in the forum questioned whether this was a course on Hadoop, or a course on Python regular expressions, but doing the exercises helped me learn some Python, and, much as I hate the language, it does have a very powerful vocabulary of regular expressions. Unfortunately, the instructor blew by the concept of Hadoop streaming so fast (in Lesson 2) that I wasn't entirely sure for a while what exactly I was doing, though I was managing to get it to work—and once I looked up Hadoop streaming on my own (it is, for the record, an API that allows Hadoop mappers and reducers to be written to be written in any language), I realized that the interface would work just as well for R.
Although the simpler exercises use an online Python compiler, for the exercises that require large datasets, the course's creators deserve kudos for having students install a virtual UNIX box on which a virtual two-machine Hadoop cluster has already been set up, and then manipulate data and write code in this realistic environment. Unfortunately, the exercises that require this virtual machine seem half-baked.
First off, the instructors haven't actually detailed how to write and execute Python scripts on the UNIX machine (the class discussion forum was very helpful here). Second, the syntax needed to make the scripts work is different from the syntax presented in the video lectures (though, fortunately, there are working sample scripts saved on the virtual machine). Third, and most seriously, one particularly tricky exercise requires knowledge that students could not possibly get from the instructions, or, in all probability, the data itself, but could only get from the hints that emerge from a trial-and-error process of submitting answers to the automated grader—it was an interesting little mystery to solve, but there are no automated graders in real life, and so I'm not sure what I gained from the effort.
Yes, figuring out ambiguous instructions does have some pedagogical value, and in the end, completing the exercises was very satisfying, but, especially in the case of the problem that was insoluble without the automated grader, I got the feeling that the difficultes I faced were the result, not of a pedagological choice, but of a simple lack of effort on the part of the instructors—and I felt like I had wasted part of my time.
According to posts in the forum, Lesson 4 was not part of the original class, though I'm not sure if it was planned all along, or tacked on later. To paraphrase Monty Python and the Holy Grail, the course was completed in an entirely different style at great expense and at the last minute. The lectures feature a different intstructor, a Udacity employee, in place of the Cloudera instructors. This lesson covers design patterns, specifically filtering patterns (more regular expressions), summarization patterns (minimums, maximums, and means, for example), and structural patterns (combining data sets); one lecture also deals with combiners, scripts inserted between mappers and reducers to make things more efficient by doing some of the reduction locally on each machine in the cluster.
I found these lectures better than the previous ones, and the exercises better prepared. I will say, though, that I eventually got bored with writing new and different regular expressions in Python, and didn't finish the last few exercises (or the final project, which isn't graded for non-paying students in any case), though I did watch all of the lectures.
In the end, this half-baked pastiche of a course at least gave me a decent idea of how Hadoop works, and removed the mystique of manipulating data stored on a Hadoop cluster. I wouldn't know how to set up a cluster myself (that wasn't the intent of the class, though I don't think it would be all that hard to do), but I do know how to use Hadoop streaming—and I've realized it's not exactly rocket science.
Monday, August 4, 2014
Online Course Review: Exploratory Data Analysis, from Coursera's Data Science Specialization
I recently completed a third course in the specialization, Exploratory Data Analysis, taught by Roger D. Peng (the previous courses I took were taught by Jeff Leek). While I enjoy Peng's lecture style (unlike Leek, he engages the audience by showing his face at the beginnings of lectures), and I learned a lot, the course suffers greatly from the short format.
I initially overlooked this class: from the name (more on this in a minute), I never would have guessed that 3/4 of the lectures would cover graphics in R. Peng teaches the basics of the language's three major graphics packages, the base graphics, lattice, and ggplot2. As is the case for Getting and Cleaning Data, the lectures manage only to skim the surface, particularly for ggplot2, but they do give the student a decent idea of what's possible in R. I do though think that for ggplot2 Peng could do a better job of outling the advanced features than simply pointing students to the book written by the package's author, Hadley Wickham (thankfully, it's possible to find free PDF's of the book online, but I'm not sure it's where I'd want to start for solving a discrete problem, rather than studying ggplot2 in a methodical way).
So what's with the name of the course? Peng presents visualization in R as a way of conducting initial exploration of data, but it's obviously useful for more than that, since R can create decent visualizations of the results of analysis. I suspect that the course name was chosen so that one week of lectures on clustering and dimensionality reduction could be shoehorned into the syllabus. This material probably belongs instead in the Pratical Machine Learning course, but something had to be cut to limit that course to four weeks (cf. the nine-week Machine Learning, also offered by Coursera, and which I've reviewed previously—twice, actually). The fact that clustering and dimensionality reduction can be used for exploratory analysis and visualization is the only thing that ties the entire course together.
What's particular disturbing is the way that all of this combines with the specialization's unique approach to exercises and evaluation. Each course includes a hands-on project, and, because open-ended projects in a MOOC must, for logistical reasons, be graded using a peer grading system, the final project for Exploratory Data Analysis only ends up covering material from the first two weeks of the course, since students need the third week to work on the project, followed by the fourth week to grade it; therefore, half the content of the class doesn't play any role in the project. On top of this—I suppose to avoid overloading students—there's no quiz, homework, or any other form of practice or evaluation covering the material on clustering and dimensionality reduction, which makes it hard for a student to know if he or she really understands those topics.
To sum up, I did find the information on data visualization in R useful, but I would have appreciated a full four weeks on the subject. The coverage of clustering and dimensionality reduction was out of place in the course; nonetheless, many will find it valuable (I had already seen most if not all of it in Machine Learning and another Coursera course, Social Network Analysis, which I've also reviewed).
I do have one more comment, though this applies to the Data Science specialization in general, and to Coursera, rather than solely to this course. Normally, after completing a Coursera course, a student can go back and look at the course archives at any later time; I've found this valuable when I suddenly find myself needing to refresh my memory or find out where I can learn more about a topic. Coursera has apparently disabled this feature for the Data Science courses: their archives are no longer accessible after the grading period is over (about a week after the finish of a course). I say "apparently" because, when I contacted Coursera a few months ago to ask why I could no longer access the archives of Getting and Cleaning Data, I never got a response—this is becoming something of a theme with Coursera, which, as I noted in my second review of Machine Learning, ignores most bug reports for that class. I suppose that paying customers might get better service, but I'm not going to pay just to find out if that's true.
Of course, you can always sign up for the current iteration of a class, since they're offered continuously, but it's annoying to have to do that each month. Fortunately, all of the class materials are also available in a GitHub repository, but it's not as easy to display documents on GitHub as in Coursera's web interface. For a set of courses that only skim the surface, and whose major value is in providing links to deeper information, this is a major failing.
Programming Languages for Big Data, Part 3
Friday, July 11, 2014
Programming Languages for Big Data, Part 2
Nonetheless, it's a pretty interesting question, and I'd love to see someone who's proficient in all of the languages involved try this test again, using better code. I'm still intrigued by the very high speed of MATLAB/Octave—something that leads Andrew Ng to recommend those languages over R for prototyping—though Tommy pointed out to me that, since R is closer to being a full-featured language, it's more flexible than the former languages.
Sunday, July 6, 2014
Programming Languages for Big Data
Therefore, I find a new study comparing the speeds of various languages for a statistical problem pretty depressing. When looking at this kind of a study, it's important to keep one big thing in mind: the authors tested the various languages on only a single task (albeit a common task, at least in economic modeling), and different languages will have different strengths and weaknesses at different tasks.
Nonetheless, the differences in run time are so large that it's not unreasonable to draw some conclusions. Even when compiled, R takes 240 to 340 times as long to run as C++. How about Python and MATLAB? Python with the default CPython compiler is nearly as slow as compiled R (155 to 269 times), but with Pypy it reaches 1/44 of the speed of C++. MATLAB takes only about 10 times as long to run as C++, or only about 50% longer when using Mex files (C, C++, or Fortran subroutines called by MATLAB). (Octave has Oct files written in C++, which serve a similar purpose; Octave can use Mex files, but not as well as MATLAB. See the GNU documentation on the subject for details.)
Wow. The botton line is that R might not be the best choice for time-consuming applications—in other words, those that have to crunch through a lot of data, especially if the calcuations involved are complex. I had read that it's slower than the alternatives, but I had no idea that the differences were so dramatic. I really should polish my Octave skills, and, judging by many of the job ads I see, knowing some C++ would not only open up possibilities for faster-running code, but would also make me more employable.
Thursday, May 29, 2014
Online Course Review: Coursera's Machine Learning, Part 2
My last time through the course (the session that began on April 22nd, 2013), I completed almost all of the lessons on supervised machine learning methods, such as regression, logistic regression, and neural networks. In this session (which began on March 3rd, 2014), I repeated those lessons, and also finished the rest, most of which covered unsupervised techniques, such as clustering and recommender systems. I won't repeat the contents of my earlier review, except to note that what I said then remains true: Andrew Ng is a clear and charismatic lecturer, he covers advanced techniques, and he provides a number of practical tips, but the programming exercises are a bit canned, and may not fully prepare students to write their own scripts in Octave.
My new comments mostly reflect comparisons to other MOOC's, particularly the two courses from Coursera's Data Science specialization that I took recently. First of all, I think that Machine Learning could do more with the online format. In fact, most MOOC's consist largely of video-recorded lectures, with the addition of a sprinkling of interactive content, but Machine Learning falls short even by comparison with other online courses. The class does feature a very effective automatic grader, but it lacks any links to additional resources, or, very importantly, notes or slides from the lectures. While the latter omission may seem trivial (I didn't notice it the first time I took the course), a lack of lecture notes makes it difficult to go back later and review material from a lecture, except by watching the whole thing again. It's true that the programming exercises include detailed instructions, but not all of the course's topics are covered by these exercises, and at any rate the organization of the instructions can make it difficult to locate information on a specific subject.
I might also amplify my comment from the earlier review that the programming exercises involve mostly copying and pasting, rather than writing entire scripts. There's a reason for this: the focus of the course is on algorithms, not on other parts of solving machine learning problems. Nonetheless, my experiences taking other courses, especially those from the Data Science specialization, have demonstrated the practical value of forcing students to think about the nuts and bolts of a research project. Machine Learning's lack of a big final project also arguably deprives students of valuable practical experience, especially since these projects usually require students to explore the course material in greater depth than do short exercises; on the other hand, the fact that a final project can only cover a single topic from the course—or at most a handful of them—calls the value of such projects into question.
My final concern is that Machine Learning seems to have gone on autopilot at this point, with little or no attention from Ng or anyone else who helped him prepare the course materials. Questions in the discussion forum are answered instead by "Community TA's", that is, volunteers who took earlier sessions of the course. Most disturbingly, the majority of reports of errors in the course materials go unanswered, and those that are answered are answered by Community TA's, who lack the ability to fix the errors. For example, a month ago I discovered that the automatic grader accepted one version of my code and rejected another, even though the two versions were algebraically equivalent. My report of this apparent bug still hasn't been answered.
Despite these concerns, I still heartily recommend Machine Learning as a valuable starting point for anyone interested in data science. While the course was offered twice in 2013, the start date of the next iteration, on June 16th, 2014, suggests that Coursera may be planning to offer sessions of the 10-week course almost back-to-back, meaning several sessions each year.
What's next for me? I'll soon be posting a review of Udacity's short Intro to Hadoop and MapReduce. After that, I'm considering taking two more courses from the Data Science specialization, first Exploratory Data Analysis, which will give me some practical experience with graphics programing in R, and then Practical Machine Learning, which will provide experience using R for machine learning, as well as a basis for comparing the machine learning course reviewed above (though the course for the Data Science specialization, at four weeks, is much shorter, and can't possibly cover the same ground).
In the meantime, while I'm still looking for work as a data scientist, I've had a number of interviews, and some of the potential employers have read and commented positively on this blog. I hope that provides an example for other social scientists out there that, yes, you can become a data scientist.
Thursday, May 8, 2014
Online Course Reviews: The Data Scientist's Toolbox, and Getting and Cleaning Data, from Coursera's Data Science Specialization
I recently completed Coursera's The Data Scientist's Toolbox and Getting and Cleaning Data, two courses that form part of the online learning provider's new Data Science specialization, taught by Brian Caffo, Jeffrey Leek, and Roger D. Peng, biostatistics professors at Johns Hopkins University, and, in the cases of Leek and Peng, authors of the Simply Statistics blog. Both of the courses I took were taught by Jeff Leek (referred to in my earlier post today). I found Getting and Cleaning Data to be an especially useful course, teaching some practical skills that are quite essential to the real-world practice of data science. However, I probably wouldn't recommend the entire specialization to anyone coming from the world of quantitative research in academia, since a big focus of the program is teaching the scientific method and the logic of statistical inference—that is, things a quantitative social scientist should know already. First, however, a little background on Coursera's specializations....
Coursera has recently introduced a handful of "specializations", each consisting of a series of short courses followed by a capstone project. The specializations continue Coursera's effort to monetize its offerings through the Signature Track, which offers a "Verified Certificate" for those who pay a fee (typically about $50-$100) to take the course.
The Signature Track in itself has dubious value. Allegedly, its purpose is to provide a more useful credential than the certificates Coursera has traditionally offered for its free classes. To make the Verified Certificate more useful (that is, more impressive to potential employers), Coursera takes measures to guarantee you did the work yourself, but these measures seem fairly easy to circumvent. Specializations add an additional sweetener: if you take every class in the specialization on the Signature Track, you can then take the capstone project (offered as an additional class), which is not available to students who take the courses for free (or even, for that matter, to students who pay for only some of the courses). Students completing the specialization also receive a specialization certificate.
The Data Science specialization includes 10 short (four-week) classes, including the capstone, each priced at $49 for the Signature Track. If you stump up the whole $490 at once, you can take any of the courses as many times as you like over the next two years (in case you don't pass the first time); if you pay for the courses one at a time, you can only retake each one once (which is probably enough—honestly, if you can't pass one of these classes, you probably don't belong in the profession, but sometimes life gets busy, and you can't finish the work for a class). Each of the first nine courses will be offered once a month; the first six are available already, and the remaining three will be offered for the first time in June. For a couple of the classes, there's also an option to substitute an alternate course on Coursera. The capstone has yet to be schedule (word in the forums has it that it'll be offered in fall), and I'm not sure how often Coursera plans to offer it.
The course that really interested me was Getting and Cleaning Data, but I signed up for The Data Scientist's Toolbox because it's required for the rest of the specialization; R Programming is also required, but I already had some experience with R, and I had no intention of completing the entire specialization, and so I skipped this one. Taking one of the later courses at the same time as the required intro course didn't pose any difficulties for me, but I think that someone who has no experience with R would probably want to complete that class before tackling any of the others.
Much of The Data Scientist's Toolbox is devoted to introducing the topics of the specialization's other eight courses; frankly, you can skip this if you don't intend to take those courses (or possibly, even if you do intend to take them—you will, after all, cover that information later, though if you're taking the whole specialization, you may need to watch the video lectures in question in order to complete the quiz for Week 1). For me, the most useful content of this class was its introductions to Git, GitHub, and RStudio (I had been using the plain old R Console, and RStudio makes things considerably easier). RStudio is required for the programming necessary to complete the assignments in the later courses, and Git and GitHub are necessary to complete the projects at the end of each course (you have to upload your work to GitHub so that other students can peform peer assessments on it). For the sake of full disclosure, let me say that I skipped the introductory lectures in Week 1 of this course (though I did pass all the quizzes), and did not complete the course project, which consisted of taking screenshots to prove that you'd installed Git, GitHub, and RStudio (I installed all three, but I wasn't really concerned with getting the course certificate).
I found Getting and Cleaning Data invaluable. I took the course because I wanted to learn how to get data off the web. For example, in the project I did for Coursera's Social Network Analysis last year, I ended up saving data from several hundred web pages by hand, which is not a particularly efficient way of doing things. Getting and Cleaning Data promises to teach students how to extract data from common data storage formats (including databases, specifically SQL, XML, JSON, and HDF5), and from the web using API's and web scraping. The syllabus also includes tips on using R to clean and recode data, and, in the last lecture, a long list of links to sources of data. It's also worth noting that the style of the video lectures is a bit different from those of other classses I've taken: there's never any video of the instructor, just the instructor's voice over the lecture notes.
Initially, I was skeptical, because most of the lectures amount to little more than a list of R packages, functions (with a few short examples), and links for further information. The information blows past you so fast that there's no hope of remembering much of it. However, the lecture notes (in both HTML5 and PDF—the HMTL5 is a little awkward to navigate, but the links work, unlike in the PDF) provide a wonderful resource that you'll find yourself referring to again and again. I've often found that the hardest part of a project is knowing where to start, and the lectures in Getting and Cleaning Data point you in the right direction; in fact, I'm using information from the lectures on web-scraping and JSON right now to do an updated version of my project for Social Network Analysis, a statistically informed visualization of which cards in the game Android: Netrunner appear together in the decks designed by players. Look for that to be posted here soon!
Among the data science courses that I've taken online, Getting and Cleaning Data is the first one that taught me how to go out and get data and then put it in a form that's usable for analysis. By contrast, Coursera's Machine Learning, taught by Stanford's Andrew Ng, provides highly practical advice on selecting and using algorithms, but does so uses very much canned programming exercises, in which the data has already been collected and processed. In fact, the two course are highly complementary, at least inasmuch as they give you ideas about how to handle different stages of a data science project. It should though be noted that Machine Learning uses Octave (essentially the open-source version of MATLAB) rather than R; the Data Science specialization includes its own (much shorter) Practical Machine Learning course, as well as an earlier course on Regression Models that delves far more deeply into that topic than does Machine Learning.
I should add that, for this class too, I never completed the final project: it looks like a highly practical exercise, but I was short on time, and more interested in my own project; again, I didn't care much about earning a certificate, with my main concern being to learn the nuts and bolts of getting data from the web.
Finally, let me offer a few comments on the Data Science specialization as a whole. I would not recommend completing the entire specialization for anyone who's well-versed in statistics and the scientfic method: if you're a competent social scientist (as opposed to someone who took one stats course as an undergraduate), you already understand important issues like sampling, causal inference, and reproducibility (though, admittedly, I've read more than a few articles by social scientists who evidently had shaky grasps on these concepts). For a specialization that labels itself as "Data Science", there's also scant coverage of databases. That being said, anyone interested in data science might find Getting and Cleaning Data, R Programming, and Practical Machine Learning useful, and for someone who doesn't have a background as a quantitative researcher, I can't recommend this specialization's focus on the scientific method and applied statistics highly enough.
Why Data Science Needs Statistics
Friday, March 14, 2014
Why Scientists Make Better Data Scientists
I find this refreshing, after spending a great deal of time lately looking at job ads for data scientists: most ads focus on experience with specific software packages, rather than experience conducting rigorous research. I suppose the former is more of an objective measure than the latter, but I'm not sure how useful it is to hire based on what applications a person has used before, especially in a profession where the start of the art changes rapidly. Another problem is that people have started slapping the word "data scientist" on a wide variety of jobs: I've seen it applied frequently to database architect positions, or even to positions that have more to do with software development than data analysis.
At the moment, all of this matters to me because the contract on which I was working ended last December, and I'm now looking for a job again. I've had two good interviews, but I'm finding it very hard to break into a profession with a background different from traditional data analysts and business analaysts. One thing I have learned is the power of networking: one of my interviews came from a contact my wife made while carpooling, and the other resulted from my submitting a resume to a small-business group recommended by a former co-worker. (Oh, and if anyone has any good job leads, I'm happy to network here, too. :)
Those of you who frequently visit my links page might have noticed that I've updated it quite a bit over the past few weeks, particularly in the sections covering online courses ("Self-teaching Resources" and "Formal Learning Resources"). Coursera and Udacity have some interesting new offerings that you might want to check out. I'm also planning to add a section listing portals and other commercial websites, and I need to go through all the links to make sure the information on them is up to date. As ever, if you have any suggestions for additional resources, please let me know!