The Aspirational Data Scientist

Sunday, February 8, 2026

Umamusume and Generative AI

After being roped in by my teenaged daughter, I've been playing a lot of Umamusume lately. It's a Japanese gacha game about "horse girls" (uma musume, or "horse daughter", in Japanese), young women with horse ears and tails in an alternate universe who inherit their names, personalities, and careers from (Japanese) racehorses in our world. The game obviously appeals to people whose preferences run toward women with animal features (kemonomimi in Japanese); that's not my cup of tea, but I won't judge. However, in what follows I am going to render some judgments about generative AI, specifically the sometimes ridiculous answers that Google's AI prepends to search results. But first, more on the game....

Rather than the girls, what does appeal to me about the game is the strategy, which involves a lot of probabilistic reasoning—right up my alley, given the love of statistics that led me to become a data scientist. Like with any gacha game, there's a "meta" strategy of deciding how to parcel out scarce (for players who don't "whale" by spending hundreds or thousands of dollars) resources to "pull" for the assets needed to play the game (horse girls and "support" cards in this case): because it's a gacha game, you never know exactly what you're going to get when you pull, but you play the odds, and work with the information you have about probability distributions, while prioritizing the most useful targets.

Unlike the average gacha, Umamusume also features a deep tactical dimension of decision-making in the daily "grind" required to build up your team of horse girls. Each day, you guide at least one trainee through a "career", making training decisions and picking skills that, if you're lucky, will give her the stats and abilities she'll need for a place on your team. Sometimes you're working with explicit probabilities (like the probability that a skill will activate if its activation trigger occurs), and other times you're working with much vaguer contingencies (like the chance that the trigger condition will occur in the first place during an actual race), but in either case you're using probabilistic reasoning to make decisions, and yay, that's fun for a person like me (YMMV).

And though horse girls are not my cup of tea, I do find many of the characters quite appealing (shout-out here to King Halo, Nice Nature, and Narita Taishin), and I have to say that their background stories are well-written. That brings me to the Google search that inspired this post. To wit, I was watching the video story for an uma named Mejiro Dober when I encountered this:

If you're like me, you're wondering, "What's this 'Bell' business?" It's not an obvious nickname for a racer, like "Tiger" or "Beast" or even "Twinkle Toes", but its origin isn't explained within the story—or, for that matter, in any of the official Umamusume lore. So naturally I turn to Google and ask, "Why is Mejiro Dober called Bell?" and I get something like what you see below:

Note that this is the best of three answers Google's AI produced for me at different times: the first time, right after the character came out on the global server (which is ~3 years behind the original Japanese server), the AI flat-out insisted to me that Mejiro Dober is not in fact called Bell, while the answer above (and another from a few hours earlier) at least hedge the response by acknowledging uncertainty, though both of the latter answers are still categorically wrong, in that the nickname does come from an official source. The AI also doesn't explicitly acknowledge that "Bell" does actually show up in search results (note the two circled hits below the AI summary), focusing instead only on the origin of the name.

Now you may be thinking, "Hey, Scott, you asked the wrong question: you asked why she was called 'Bell', not whether she was called 'Bell'." Yeah, that's true, because I already knew she was called that, and just wanted to know why. In all fairness, the "why" question is tougher, and in the first paragraph of the summary the AI quite rightly notes that it can't find an explanation. My problem is more with the second paragraph. But just for jollies, here's the AI's answer when I asked whether she's called Bell:

Still wrong.

Presumably, the AI is using retrieval-augmented generation (RAG), meaning it's not just spitting out something retrieved from the model's own training (like what would happen if you asked ChatGPT), but rather doing a web search and then summarizing the results. Kudos to Google for following RAG best practice by linking to the top sources next to or underneath the AI summary, but in both screenshots the websites the AI links are actually the two commonly consulted fan-made wikis, not official sources (and the embedded video in the lefthand one is actually an ad on the wiki site for an interview with the cast of Stranger Things, so not at all helpful). The wikis are reasonably authoritative, and reproduce art and text from the official website, which might confuse a well-meaning AI model, but they aren't official.

The obvious reason that Google's AI doesn't link to the official sources is the Umamusume webpage for Dober contains all of one short paragraph of information, and beyond that, to get official info, you'd have to look at the game's social media accounts, the background story videos (helpfully posted on YouTube), and the anime associated with the game (also on YouTube, but not entirely canonical, as the anime characters are often somewhat different from their game counterparts). I'm not sure how if it all Google's AI consumes these sources, but it's entirely doable with modern multimodal large language models (LLM's) to extract the text from videos (they're subtitled!) and index it along with everything else Google stores from the web; and yes, you can do the same with actual voice, which Google already does by generating subtitles with AI. Given the amount of information currently stored only in video format, you'd think the company would be doing that already. If the Google AI isn't looking at these more exotic sources (and maybe it's not, because "Bell" appears several times in the background story videos), then it's deceptive to state it couldn't find anything in official sources, since it's not actually looking at most of them in the first place.

As you might have guessed, the Dober/Bell mess is not the only questionable result about Umamusume that I've seen from Google's AI. Once, it even gave me advice on when to use skills during a race (not useful, because skill activation is automated and rule-based, not under the player's control), and I wish I'd screenshotted that one. That instance was uniquely bad, but here's a more typical response. I asked what the best skills are for Late Surgers (one of four racing styles, each defined by where they run relative to the rest of the pack for the majority of a race):

A few of these recommendations are good, and a few are highly questionable (though they probably represent something some ill-informed player posted somewhere), but I'd like to focus on the ones that are flat-out wrong.

Most obviously, the skills circled in red can't actually be used by Late Surgers. Speed Star works only for Pace Chasers (another style, which runs closer to the front). "Seuin Sky (Reeling in the Big One)" isn't even a skill, but rather the original variant ("outfit") for the uma Seuin Sky. The outfit's unique skill, Angling and Scheming, can be inherited by another uma, but, while it technically can work on any runner, to trigger it you have to be ahead on a corner late in the race, and so it's usually used only by Front Runners.

The errors circled in blue are more subtle. Let's start with Uma Stan and Ramp Up: they're actually completely different skills, with entirely different trigger conditions, but neither of one of them is likely to trigger in the late race (Ramp Up must trigger mid-race, and Uma Stan, because it can trigger any time a runner is close to 3 other runners, tends to trigger in the early race). Furious Feat and Position Pilfer seem to be presented as if they're different versions of the same skill (by way of comparison, above that line, you'll see On Your Left!, which is the premium, or "gold", version of Slick Surge, and the same is true of Rising Dragon and Outer Swell), but Position Pilfer is actually the non-premium ("white") version of Fast & Furious, which sounds a lot like Furious Feat, but isn't the same thing at all. Notably, while Position Pilfer and Fast & Furious are restricted to Late Surgers, Furious Feat, though readily usable by Late Surgers (it works on anyone in the back half of the pack) is restricted to Mile-distance races. The conflation of these skills explains the weird reference to "Mile and other distances".

In short, the advice provided by the AI here is effectively useless: there are some sound suggestions, but you need to know Umamusume pretty well to pick the wheat from the chaff, and anyone who knew the game that well wouldn't be asking this question in the first place (or would be asking for far more detailed answers on each skill, considering pros and cons).

It's true that generative AI can excel at so-called "zero-shot" tasks, constructing new things (like a list of good Late Surger skills) by assembling information using well-established rules and relationships. But performing this kind of "transfer-learning" task successfully requires that the model discriminate between what things it can transfer from one domain to another, and what things it can't. That works in domains like politics and economics and even real-life horse-racing, where as a model is trained it can extract those rules and relationships from billions of words of text. However, it tends to fall apart in a highly specialized domain about which people have written comparatively little, especially if it's a general-purpose model (like Google's AI), in which case it might try to transfer rules and relationships it really shouldn't. This is how we get the answer I didn't think to screenshot, treating Umamusume as if it were a game that allows players to make decisions during a race (which is how most racing games work). And it's how we get a response like the one in the screenshot above, where the AI can't figure out the rules well enough to plug the nuggets of info it's pulled from the web into the right places.

The common thread between both this and the Bell problem is that the AI model just doesn't have enough information on Umamusume to work with. For a human, there's more than enough information available to figure out the game, but training a generative AI uses a brute-force approach that requires lots and lots and lots of info to learn patterns that humans can pick up with a few minutes of light reading.

Oh, in case you're still wondering why Mejiro Dober is called "Bell", I did finally figure that out. I had a hunch it was one of those things that make more sense in the original language than in translation, so I consulted Google Translate, and sure enough, turns out the Japanese word for bell is "beru", while Dober's name in Japanese is actually (ignoring the subtleties of proper transliteration) "Mejiro Doberu", because "Doberu" is short for "Doberuman", the Japanese version of "Doberman"—all the Mejiro Farm foals that year were named after breeds of dog. The pun would be so obvious to a Japanese-speaker that it likely would rarely be commented on, making it hard for a RAG AI to find references to it even if the AI were pulling from Japanese as well as English sources. An AI language model might be able to recognize and reproduce word play, but figuring out that an English nickname derives from word play in another language appears to be beyond this AI's capabilities.

Monday, July 10, 2017

The Relationship between Machine Learning and Statistics

UPDATE from original 7/10/17 version to 9/14/17 version: I erroneously maligned confidence intervals for models of big datasets, conflating them with statistical significance; I've fixed that mistake below.

Like anyone who practices data science, I often get asked, by relatives and acquaintances, what "data science" is. Like any such question, it's not very hard to answer this one to the satisfaction of someone who knows little about the topic: in my case, I tend to describe the discipline as applying the principles of traditional statistics to large amounts of data, and I throw in mentions of the importance of writing code and manipulating databases. Nowadays, you can also mention machine learning, and many people will have at least a vague idea of what you're talking about.

However, even if the answer satisfies most listeners, it bothers me—because I've always wondered exactly where "traditional statistics" ends and "machine learning" begins. Defining that boundary turns out to be surprisingly difficult, but also pretty useful: it's one of those cases where the journey is more important than the destination. It's not really important exactly where we draw that line, but thinking about how machine learning differs from traditional statistics leads to further questions about whether (or rather, when) we can apply the accumulated wisdom of decades of statistical practice and quantitative research to the newer domain, and the answers to those further questions prove to be quite valuable.

The Easy Answer

The easy answer to the question is that machine learning is what statistics becomes when there's too much data to manipulate using traditional statistical algorithms. The most obvious illustration of this transformation is linear regression, where gradient descent replaces direct solution. The transformation has other implications as well: while direct solution (and other traditional algorithms) have long been built into statistical packages like SPSS, until recently, someone who wanted to use gradient descent would have to know at least enough code to install the right package and call the right function. (Mind you, we're seeing more and more machine learning algorithms packaged into easy-to-ease GUI's nowadays, which will leave the coding for those who want to tweak algorithms, create new ones, or build them into applications--much as most users of SPSS never learn scripting, but experts in statistical methods can use it to create powerful extensions to the original package.) Likewise, storing and processing large datasets lends itself to database applications, which can serve up all that data much more efficiently than the traditional method of reading in a CSV.

But the easy answer, while coherent, isn't entirely right. Data scientists actually use a number of techniques that we think of as "machine learning" even when the amounts of data involved are relatively small—indeed, no bigger than what a quantitative researcher in the 1980's would have dealt with. In 2015, my first year as someone with the job title "data scientist", my team worked on a number of demonstration projects for recommender systems. Because we hadn't deployed those systems yet, we usually didn't have real user data, let alone petabytes of it, and even where we did have all of the real data, it wasn't necessarily very big: for example, we used natural language processing (NLP) to measure the similarity between different pages on a website, and that website only had about 1200 pages, each of which included actual content of about two paragraphs. Nonetheless, we never doubted that our applications of collaborative filtering and NLP were "machine learning".

Why? Well, I've never been entirely sure, but I think the answer is that machine learning includes all of those algorithms whose development was prompted by increasing amounts of data and increasing amounts of computing power. The Doc2Vec we used to analyze those 1200 web pages could probably have run on my TRS-80 Color Computer back in the 1980's (it might have been an all-night job), but no one had invented it yet. The same applies to collaborative filters and any number of other recently developed methods that produce useful results even with smallish datasets. All of these algorithms get labeled "machine learning" because they were invented by people who did "machine learning", and, just like the methods used on truly big data, they're usually applied through code rather a traditional statistical package.

However, that's a pretty messy answer, and it really begs the question of the extent to which the difference between traditional statistics and machine learning is a matter of style (or, to put it more nicely, work methods and habits of thought) than of substance.

Interesting Discussion, Scott, But Why Does That Matter?

Yes, there's a point to all this. To wit, the important thing to understand here is that, because there's no bright line between traditional statistics and machine learning, the laws of statistics weren't abolished the first time someone programmed a gradient descent algorithm onto a computer. To me, as a former quantitative researcher in the social sciences, that point has always been blindingly obvious—but in all the machine learning classes I've taken over the years, I've seen only occasional mentions of the relationship between older and newer methods, and I've almost never seen a discussion of the implications of the laws of statistics for machine learning. I've always been struck by this, because really, it's pretty easy to figure out some of those implications.

For example, when your data really is big, you don't have to worry about certain things: the variance due to random sampling is infinitesimal, which means that any differences you find are statistically significant (i.e., if your sample is unbiased, etc., you can be sure the differences are real, though that doesn't in itself imply that they're meaningful). But, as I pointed out above, the data handled by machine learning algorithms isn't always big, and how many data scientists bother to think about exactly how big a dataset has to get before you can stop thinking about significance tests? Confidence intervals present a somewhat more complex problem: with enough data to eliminate error due to random sampling, confidence intervals will be smaller, but when you've got randomness in the model (that is, your model doesn't account for 100% of the variance in outcomes), you still need confidence intervals (or something equivalent) to express the variability of possible outcomes. I've met data scientists who worry about these problems, but not many of them. Heck, for some of the new techniques, like neural nets, I'm not even sure how you'd go about computing a confidence interval. Feel free to Google it: yes, it can be done, but it's not something that even crosses the mind of the average data scientist, and I've never seen the topic so much as mentioned in a machine learning class I've taken.

The implication of statistics that causes me personally the most grief is regularization: regularization is really, really useful because it allows us to solve a linear regression equation even when the number of independent variables (er, sorry, "features") is greater than the number of cases—for someone trained in traditional statistics, it's nothing short of glorious magic, allowing you to do what should be impossible. So why my grief? Well, there are often cases (remember, data today can get very, very big) when the number of lines of data far exceeds the number of features in the model.

Having put much thought into the problem, I cannot figure out a very good reason why you actually need regularization in such a case, and I can see some real downsides to it: it requires more processing, and it will likely produce a less accurate result. And yet, in all of the machine learning classes I've taken, I've never seen a discussion of this issue, and I rarely see a machine learning package whose functions allow the programmer to decide not to use regularization—you can accomplish the same effect by putting in a tiny number (yes, the model still converges without any meaningful regularization, provided you have enough degrees of freedom), but of course, in doing so you can't get the computational advantages of leaving out regularization completely. There's an analogous argument for validation to avoid overfitting: if your dataset is huge, and your training sample is randomly selected, you really shouldn't have overfitting.

I may be utterly wrong on both of these points, but the larger concern is that none of the classes I've taken on machine learning has even raised these issues. The silence is so deafening that, in executing the coding exercises that are often required for job applications, I've submitted regularized models when I knew (or at least suspected) that regularization was pointless (I did, though, note that in my response, and in one case, I submitted an unregularized model alongside the regularized one--I sometimes wonder if that might have kept me from getting the job.) Even if I'm wrong, and the people teaching classes and coding machine learning packages have thought carefully about whether regularization and validation are actually needed in all cases, it would be useful to learn about the reasons for their decisions; after all, there are always situations in which a given method doesn't apply very well, and if you don't understand the assumptions behind a method, you won't be able to identify those situations.

And don't even get me started about the importance of training in statistical research for distinguishing causation from spurious correlationn, as well as avoiding a variety of other analytical pitfalls.

So...when do we start giving every aspiring data scientist real training in statistics?

Tuesday, August 4, 2015

Online Course Review: Coursera and Stanford University's Mining Massive Datasets

It's been quite a while since I last posted—and eight months since I finished the class I'm reviewing. As it happens, I finally got a job as a data scientist in late January, and work has kept me busy. That job will be the subject of my next post, but right now, we're talking about Mining Massive Datasets, offered by Coursera and three professors from Stanford University, Jure Leskovec, Anand Rajaraman, and Jeff Ullman.

The seven-week course covers the same ground as the trio's book Mining of Massive Datasets, though in much less detail. (That link, incidentally, is to the e-book; if you really want the hardcover, you're welcome to follow this sponsored link to Amazon.) It's now been offered twice on Coursera, with a third iteration set to start on September 15th. Oddly, I never intended to take this class: a friend of mine was interested in it, and I signed up so that we could take it together, but initially I had passed on it. I had taken a look at the schedule, seen a few topics that I had covered before in other courses, and decided that I wouldn't get a lot out of it.

What I missed, in my hasty scan of the course description, was that Mining Massive Datasets is not the typical data science course that shows students how to put useful algorithms into practice through code. Instead, this course is about the algorithms themselves: how they work, why they work at scale, and how they've been modified to improve performance or cover different situations. There is a lot of math: this is not a course for someone without a solid background in calculus and linear algebra (it's not like you need to remember how to integrate esoteric functions—but you do need to understand the basics). There are not, on the other hand, any programs to write: many of the exercises absolutely can't be done without writing short scripts or using a statistical language from the command line, but the professors don't require any specific language, and the code isn't graded, only the answers. The point is not to create functioning implementations of the algorithms in question, but rather to understand the nuts and bolts of how they work. The exercises are challenging, and sometimes require consulting the e-book, especially when the complexity of a topic makes it hard to grasp in a short lecture.

Although the bulk of the course is devoted to algorithms, Week 1 provides an excellent description of how HDFS and MapReduce work—without ever giving details on Hadoop or the various languages used to write mappers and reducers. In fact, I came away from Mining Massive Datasets with a far better conceptual grasp of distributed file systems than I got from the Udacity course devoted entirely to the subject. (See my review of that course here.)

It should be said that Jeff Ullman is an excruciatingly monotonic lecturer, who sounds like he's reading everything directly from notes—and you likely wouldn't be at all surprised if he suddenly called out, "Bueller...Bueller...." In addition, his explanations are not as clear as those of Leskovec and Rajaraman. In fairness, though, Ullman tends to cover the most complex topics in the course, and I was always able to figure things out by consulting the book (which covers the material in greater depth, anyway).

In summary, I strongly recommend Mining Massive Datasets for anyone who wants to understand the nitty-gritty of algorithm design for big data. The course is not, however, for the faint of heart. You could make a very successful career in data science without ever taking or it taking anything like it—but taking it will certainly make you better at the profession.

Saturday, September 13, 2014

The Ethical Challenge of "Passive Predation" in Data Science: Can Data Science Provide the Solution, and Not Just the Problem?

I recently ran across an intriguing blog post from Michael Malek, on "Predatory Data Science". Malek notes that data science methods, especially "black box" machine learning, can unintentionally create what he calls "passive predation"—that is, taking advantage of some vulnerable group despite having no intention to do so. He uses the example of a machine learning model, created for a gun manufacturer, that ends up targeting marketing efforts at the suicidal, by identifying keywords associated with depression. The data scientist using the tool in question wouldn't have intended that result, and probably would never even be aware of it, because the group of suicidal depressives would be buried amidst thousands of other micro-segments identified by the same application.

Malek perhaps overdraws his point in the middle part of the post—a historical account of the dehumanizing effects of technology that's reminiscent of Marx's condemnation of working for money in "The Alienation of Labor"—but his main argument is quite sound, and not a little scary.

I wonder, though, if data science itself could provide a solution to this problem. I hereby announce a very unofficial contest, with prizes that will prove trivial at best (I might take a winner out to lunch, or talk about his or her idea at a Data Comunity DC meetup). Pretty much any method of accomplishing this goal, technical or non-technical, is fair game. Any takers?

Thursday, September 11, 2014

Online Course Review: Udacity's Intro to Hadoop and MapReduce

For my first course on Udacity, I decided to take Intro to Hadoop and MapReduce, a course created in conjunction with Cloudera, a company whose business model is based on the open-source Apache Hadoop. To sum up my asseessment, the course was useful, but could have been done much better.

The four-lesson course (short by Udacity standards) is supposed to take about a month to complete—like all Udacity course, and unlike those of Coursera, this is not a true MOOC, taken alongside other students in real time, but rather an interactive tutorial. However, Udacity's model does feature student discussion forums; customers who pay (at the rate of $150/month) also get help from live coaches, feedback on their final projects, and the opportunity to earn a "verified certificate", similar to Coursera's Signature Track, with the difference that Udacity, unlike Coursera, no longer offers certificates for non-paying students. (As I've mentioned before, a verified certificate and two dollars may buy you a cup of coffee, but I wouldn't count on its having any greater worth.)

Before I delve into the specifics of this course, let me say that I'm not a real fan of the Udacity interface. While both providers break each lesson up into a series of short videos, Coursera labels each of those videos with a topic, making it relatively easy to go back and find the material you need; by contrast, Udacity strings all the videos for a particular lesson together under a single heading, and so you have to hunt through all of them to find something (you can click on individual videos, and each one has its own label, but you have to click on or hover over a video to see the label). In addition, whenever the video stops for a quiz, it drops out of fullscreen (assuming you're in fullscreen, of course). Moreover, Udacity's discussion forum (note the singular there) has no organization whatsoever, aside from keyword tags—making a search for specific information rather laborious.

Thr first three lessons of this particular course, which features two instructors from Cloudera, are structured in a manner that the director of a music video would appreciate: many of the videos are very short, and switch jarringly from one instructor to the other. Nonetheless, the instructors are engaging, and there's a nice interview with Doug Cutting about how he helped to create Hadoop, and named it after his toddler son's stuffed elephant. The first two lessons, which explain the basics of how Hadoop and HDFS work, can best be described as "lite"—unchallenging nearly to the point of tedium.

Lesson 3 marks an abrupt change: this is where the programming exercises began. The class requires previous experience with Python, which I lacked, and so the exercises took more time for me than they should have, but I managed. One student in the forum questioned whether this was a course on Hadoop, or a course on Python regular expressions, but doing the exercises helped me learn some Python, and, much as I hate the language, it does have a very powerful vocabulary of regular expressions. Unfortunately, the instructor blew by the concept of Hadoop streaming so fast (in Lesson 2) that I wasn't entirely sure for a while what exactly I was doing, though I was managing to get it to work—and once I looked up Hadoop streaming on my own (it is, for the record, an API that allows Hadoop mappers and reducers to be written to be written in any language), I realized that the interface would work just as well for R.

Although the simpler exercises use an online Python compiler, for the exercises that require large datasets, the course's creators deserve kudos for having students install a virtual UNIX box on which a virtual two-machine Hadoop cluster has already been set up, and then manipulate data and write code in this realistic environment. Unfortunately, the exercises that require this virtual machine seem half-baked.

First off, the instructors haven't actually detailed how to write and execute Python scripts on the UNIX machine (the class discussion forum was very helpful here). Second, the syntax needed to make the scripts work is different from the syntax presented in the video lectures (though, fortunately, there are working sample scripts saved on the virtual machine). Third, and most seriously, one particularly tricky exercise requires knowledge that students could not possibly get from the instructions, or, in all probability, the data itself, but could only get from the hints that emerge from a trial-and-error process of submitting answers to the automated grader—it was an interesting little mystery to solve, but there are no automated graders in real life, and so I'm not sure what I gained from the effort.

Yes, figuring out ambiguous instructions does have some pedagogical value, and in the end, completing the exercises was very satisfying, but, especially in the case of the problem that was insoluble without the automated grader, I got the feeling that the difficultes I faced were the result, not of a pedagological choice, but of a simple lack of effort on the part of the instructors—and I felt like I had wasted part of my time.

According to posts in the forum, Lesson 4 was not part of the original class, though I'm not sure if it was planned all along, or tacked on later. To paraphrase Monty Python and the Holy Grail, the course was completed in an entirely different style at great expense and at the last minute. The lectures feature a different intstructor, a Udacity employee, in place of the Cloudera instructors. This lesson covers design patterns, specifically filtering patterns (more regular expressions), summarization patterns (minimums, maximums, and means, for example), and structural patterns (combining data sets); one lecture also deals with combiners, scripts inserted between mappers and reducers to make things more efficient by doing some of the reduction locally on each machine in the cluster.

I found these lectures better than the previous ones, and the exercises better prepared. I will say, though, that I eventually got bored with writing new and different regular expressions in Python, and didn't finish the last few exercises (or the final project, which isn't graded for non-paying students in any case), though I did watch all of the lectures.

In the end, this half-baked pastiche of a course at least gave me a decent idea of how Hadoop works, and removed the mystique of manipulating data stored on a Hadoop cluster. I wouldn't know how to set up a cluster myself (that wasn't the intent of the class, though I don't think it would be all that hard to do), but I do know how to use Hadoop streaming—and I've realized it's not exactly rocket science.

Monday, August 4, 2014

Online Course Review: Exploratory Data Analysis, from Coursera's Data Science Specialization

Back in May, I reviewed two of the short courses that make up Coursera's Data Science specialization. Although the four-week format greatly limits the content of any one course, I was generally impressed by the scientific approach of the specialization (something all too often lacking in data "science"), and, in the case of Getting and Cleaning Data, by the many pointers provided to R packages and sources of information for further study: the course may not have gone into a lot of depth, but it provided a good overview of what you can do with R.

I recently completed a third course in the specialization, Exploratory Data Analysis, taught by Roger D. Peng (the previous courses I took were taught by Jeff Leek). While I enjoy Peng's lecture style (unlike Leek, he engages the audience by showing his face at the beginnings of lectures), and I learned a lot, the course suffers greatly from the short format.

I initially overlooked this class: from the name (more on this in a minute), I never would have guessed that 3/4 of the lectures would cover graphics in R. Peng teaches the basics of the language's three major graphics packages, the base graphics, lattice, and ggplot2. As is the case for Getting and Cleaning Data, the lectures manage only to skim the surface, particularly for ggplot2, but they do give the student a decent idea of what's possible in R. I do though think that for ggplot2 Peng could do a better job of outling the advanced features than simply pointing students to the book written by the package's author, Hadley Wickham (thankfully, it's possible to find free PDF's of the book online, but I'm not sure it's where I'd want to start for solving a discrete problem, rather than studying ggplot2 in a methodical way).

So what's with the name of the course? Peng presents visualization in R as a way of conducting initial exploration of data, but it's obviously useful for more than that, since R can create decent visualizations of the results of analysis. I suspect that the course name was chosen so that one week of lectures on clustering and dimensionality reduction could be shoehorned into the syllabus. This material probably belongs instead in the Pratical Machine Learning course, but something had to be cut to limit that course to four weeks (cf. the nine-week Machine Learning, also offered by Coursera, and which I've reviewed previously—twice, actually). The fact that clustering and dimensionality reduction can be used for exploratory analysis and visualization is the only thing that ties the entire course together.

What's particular disturbing is the way that all of this combines with the specialization's unique approach to exercises and evaluation. Each course includes a hands-on project, and, because open-ended projects in a MOOC must, for logistical reasons, be graded using a peer grading system, the final project for Exploratory Data Analysis only ends up covering material from the first two weeks of the course, since students need the third week to work on the project, followed by the fourth week to grade it; therefore, half the content of the class doesn't play any role in the project. On top of this—I suppose to avoid overloading students—there's no quiz, homework, or any other form of practice or evaluation covering the material on clustering and dimensionality reduction, which makes it hard for a student to know if he or she really understands those topics.

To sum up, I did find the information on data visualization in R useful, but I would have appreciated a full four weeks on the subject. The coverage of clustering and dimensionality reduction was out of place in the course; nonetheless, many will find it valuable (I had already seen most if not all of it in Machine Learning and another Coursera course, Social Network Analysis, which I've also reviewed).

I do have one more comment, though this applies to the Data Science specialization in general, and to Coursera, rather than solely to this course. Normally, after completing a Coursera course, a student can go back and look at the course archives at any later time; I've found this valuable when I suddenly find myself needing to refresh my memory or find out where I can learn more about a topic. Coursera has apparently disabled this feature for the Data Science courses: their archives are no longer accessible after the grading period is over (about a week after the finish of a course). I say "apparently" because, when I contacted Coursera a few months ago to ask why I could no longer access the archives of Getting and Cleaning Data, I never got a response—this is becoming something of a theme with Coursera, which, as I noted in my second review of Machine Learning, ignores most bug reports for that class. I suppose that paying customers might get better service, but I'm not going to pay just to find out if that's true.

Of course, you can always sign up for the current iteration of a class, since they're offered continuously, but it's annoying to have to do that each month. Fortunately, all of the class materials are also available in a GitHub repository, but it's not as easy to display documents on GitHub as in Coursera's web interface. For a set of courses that only skim the surface, and whose major value is in providing links to deeper information, this is a major failing.

Programming Languages for Big Data, Part 3

And now, one more word on the subject of R's speed. At my prodding, Tommy Jones contacted the authors of the study on programming language speed, and a productive discussion ensued. It turns out that the task in question was one that couldn't be vectorized, which means that R's main strength couldn't be applied in this case. However, it was possible to speed it up by writing C++ functions in R using Rcpp. The authors tried this, and revised their paper, reporting that, using Rcpp, R performed the task only 4-5 times slower than C++. For details, see Tommy's blog post, and the revised paper.

Labels