Tuesday, February 26, 2013

Is There an Academic Role for the Social Sciences in Data Science?

I've spent most of the space in this blog exploring routes for a social scientist to become a data scientist. However, Justin Kern's recent article "The State of Business Intelligence in Academics" has turned my thoughts briefly back to academia. Kern reports on a survey by the organizers of BI Congress that found that, although business intelligence courses are taught primariliy by information technology and management information systems departments, they're increasingly being offered outside those discplines, in particular in finance, marketing, and accounting.

I wonder what role social scientists should be playing here? Should we merely be taking classes from other departments, or should economics, poiltical science, and sociology departments have their own offerings in data science? (Since this survey reported specifically on business intelligence courses, it's possible that similar courses in the broader field of data science were missed, but I would suspect that such offerings are pretty rare in the social sciences.) I'd love to get some comments on this subject.

Monday, February 18, 2013

Political Ideology and Consumer Brand Choice: Applying Social Science to a Marketing Problem

This morning Shep Parke sent me a copy of Harvard Business School newsletter The Daily Stat, which links to "Ideology and Brand Consumption", a 2011 paper from marketing professors Romana Khan, Kanishka Misra, and Vishal Singh. The jist is that buyers of consumer packaged goods (CPG's) in counties that have voted more Republican or where more people go to church buy more goods from established brands and fewer generic or store-branded goods, and also fewer goods from newer brands.

The authors' explanation of these findings is that people who have conservative ideologies are more fond of tradition and the status quo, and wary of change. The paper obviously caught my eye because of the political component, especially as it touches on voting behavior, which is one of my specialties. I thought that this article might provide me a chance to give an example of how a social scientist can be valuable in analyzing business data.

The value of a social scientist here might not be obvious: after all, three social scientists already did the hard work, and now, don't we have an actionable insight for producers of consumer products? Not quite: the paper's findings are interesting, and I'd have no qualms publishing them in an academic journal as a starting point for discussion, but I wouldn't risk money and brand loyalty on something this vague and uncertain. As a social scientist, not only can I identify the questions that this research doesn't answer, but I can also suggest some practical ways to answer those questions. The short version is that we need more detailed data, preferably at the individual level, and that we might even want to run a few experiments to test our hypotheses.

There are two basic issues here. The first is whether we can make the jump from county-level data to individual consumers. The second is whether conservative personality traits actually influence shoppers' buying choices, or whether there's something else going on that just happens to be related to both factors. Let's address the problem of county-level data first. The problem we have is that we know more products of certain types are leaving the shelves in conservative counties, but we don't know exactly who's buying them. For example, it's logically possible (if unlikely) that the liberals in conversative counties buy more goods from established brands than the liberals in other places.

A more realistic concern is that, as any student of market segmentation can tell you, human psychology doesn't divide us neatly into two big groups, "conservatives" and "liberals". The fact that a candidate has to win a majority of the electorate leads voters naturally to bunch up into two competing groups , but there's a lot of diversity within each of those groups, as all of us who went to college have probably seen in the two-dimensional political graph that college Libertarian clubs like to trot out. But though we only have two choices as voters, as consumers, we have many more, and people who vote together might not shop together.

In the nineteenth century, when the big issues were things like freeing the slaves or allowing men without property to vote, we could safely say that conservative people favored the status quo, and no doubt many "conservatives" still do, but are those people who seek the safety of the well-known really the same as Tea Party members who want a revolution to roll back decades of big government, or the libertarians who favor gay marriage as ardently as they do low taxes? If it's actually just one group of Republican voters that favors the tried and true, we'd get a lot more bang for our marketing buck by focusing directly on them, or, at least, on places where they make up the biggest part of the population. There are other questions we could ask here, but you get the idea.

Even if we can identify the relevant segment of conservative voters who buy established brands, how can we be sure their conservative personality traits are what lead them to make those buying decisions? This question of causality is the one that, more than any other, keeps social scientists up at night. Sure, the psychological explanation offered by the paper's authors is a plausible story, but you can create lots of different plausable stories to explain any given set of facts (anyone who doesn't believe that should consider how quickly the latest management and marketing advice changes).

Here's one plausible story: the authors looked at sales from the same chain of stores in different counties, but the same chain typically offers a different mix of products at different stores. Places that are less densely populated typically have smaller stores, which offer a narrower range of goods, and I'd be willing to bet that where stores offer a narrower range, those offerings are dominated by established brands. And do you know what else is true of less populated places (that is, rural areas vs. big cities)? They tend to be more conservative. In other words, it's entirely possible that people in conservative counties buy more established brands because they don't have much of a choice.

How does a social scientist address these issues to pull out some information that we can act on? First of all, I'd try to find some individual-level data. We may already have data on individual brand choices from store loyalty cards, but to make that useful, we need individual-level pyschological data—that is, we need to know that specific individuals with personality trait X buy brand Y. The closest thing to that we're likely to have is demographic data on the holders of loyalty cards (both the data we gather from the loyalty card program, and data we can obtain from other sources and join with the loyalty-card data), but that may actually be counterproductive: sure, people with high incomes are more likely to vote Republican, but what actually interests us is people who vote Republican because they favor tradition, and their demographic data doesn't tell us a whole lot about that or any other personality trait. We're not, after all, actually interested in whether or not people vote Republican, but in the personality traits that make them both vote Republican and buy one brand rather than another.

In the end, assuming I work for a retailer with a loyalty-card program, I would try to survey a sample of card-holders (perhaps we could offer them coupons or some other incentives to participate). Actually, I probably wouldn't even ask political questions in the survey, because personality traits are what we're actually after (even if the observation about politics was what inspired us in the first place), and political questions might well offend our shoppers.

Getting individual-level data would get us closer to showing that personality traits cause consumers to make particular brand choices, both by looking directly at the traits and choices, and by allowing us to rule out other possible causes—for example, we could look at demographic data and psychological data at the same time in order to see which are related more closely to brand choices. And frankly, in the social sciences, that's about the best we can usually do. The "gold standard", though, is randomized experiments, because, if you divide people into two, randomly-chosen and essentially identical, groups, and then do X to one and Y to the other, you can be pretty sure that any differences you see after that point are due to the difference between X and Y. We rarely do experiments that look at behavior in the real world (as opposed to a psychology lab), because they're expensive and they pose ethical questions, but they're pretty viable for a big retailier with outlets all over the place.

With personality traits, a true experiment is never possible, because you can't force people to have certain traits (how many parents, teachers, and managers have wished it were otherwise?), but we could, for example, make sure that two (or more) stores in places with different sorts of shoppers carried exactly the same selections in one or a few product groups, thus ruling out different lineups as a cause of different brand choices. This might cost us money (lost sales or extra inventory), but unlike an academic researcher, we can recoup that cost in higher sales that result from the new information.

It's interesting to note that this experimental approach can yield results even without individual-level data—that's the logic behind introducing products in test markets, after all—but if we can combine the experiment with a survey of the shoppers at the stores taking part in the experiment (even though using both the survey and the experiment is our most expensive option), we can leverage the data from the experiment and the survey to get more benefit out of both.

I'm an Author a Musician? Who Knew?

I've noticed that many of the books advertised by Amazon on the sidebar of this blog are by one "Scott Orr". Just in case it isn't already clear, that Scott Orr is not the same Scott Orr writing this blog. Heck, I don't even know who that guy is, though I'm sure he's perfectly nice.

UPDATE:. Actually, yes, I am an author. I wrote this, after all. I've written other things, too, for that matter, many of them published. More to the point, my namesake appears, on closer examination, to be selling music.

Friday, February 15, 2013

Stanford's Introduction to Databases vs. Big Data University's SQL Fundamentals I: A Comparison and Review of Online Courses

It's been too long since I last posted here. To put things simply, I found that I was spending more time maintaining the blog as a resource for others to learn data science than I was spending actually learning data science myself. Now that I've got more experience with online courses under my belt, I'd like to share my insights.

When I first became interested in data science, I began to take Big Data University's SQL Fundamentals I course. This course uses IBM's free DB2 Express-C platform, and content and links on the DB2 webpages, as well as the branding in some of the older course material, indicate a connection of some sort between Big Data University and IBM itself, though I couldn't find any statement of the nature of this connection. Posts in the Big Data forums mention that SQL Fundamentals I was originally a true course, with a schedule and interaction between teachers and students; it's now a self-paced course that makes use of video lectures by a variety of instructors, exercises (downloaded in PDF form), required reading (from free e-books produced by the DB2 community), and a final exam. It covers not only the basics of SQL (including queries and database modification), but theoretical modules as well, specifically, relational algebra and relational design theory.

Stanford's Stanford University's Class2Go offers only a small number of courses (three, at present), but these include a 10-week Introduction to Databases, taught by Professor Jennifer Widom, which was originally offered in Fall Quarter of 2011. It's being offered again now, in Winter Quarter of 2013. Rather than partnering with a third-party provider of massive open online courses (MOOC's), such as Coursera or edX, Stanford has opted to go it alone, hosting the course on its recently established Class2Go (interestingly, Stanford professor Andrew Ng's popular Machine Learning course is still on Coursera, which Ng co-founded). Introduction to Databases uses video lectures, interactive online quizzes and exercises, and exams; supplemental readings are suggsted but not required. This review will address the parts of the course that cover the same subjects as the Big Data University course: parts of the introduction, relationsal algebra (most of week 2), SQL (week 3), and relational design theory (week 4); the 10-week course also covers XML, UML, OLAP, NoSQL, and some advanced SQL topics, such as triggers, views, and authorizations.

First of all, full disclosure: I didn't get past "Getting Started" and "Lesson 1" in the Big Data University course before I started taking the Stanford one. That may not sound like a lot, but it includes most of the reading (seven of the eight chapters) in the course—and the most difficult reading—and I got bogged down with that, though I did finish all of it. That means, ironically enough, that I never completeed the actual SQL portions of the Big Data course, though I've since examined some of the lectures and exercises for those portions. I've been through only the first four weeks of the Stanford course, but I wanted to publish this review in time for readers to join the course late and still be able to get something out of it.

On the whole, I think the Stanford course is the better one, mostly because the lectures contain more material, and the exercises are more demanding. Specifically, the Stanford course features lectures that go into greather depth, and exercises in relational algebra and writing SQL queries that require a lot more thought than the Big Data exercises on the same subjects; there's more emphasis on the logic being applied, rather than merely learning rules and syntax, but, as is often the case, struggling with difficult problems helps to solidify memory of rules and syntax. It also helps that the Stanford course is more interactive: I haven't made any use of the virtual office hours provided by the course's teaching assistant, and both courses have forums, but the Stanford course has a few nice extra features, such as short quizzes during lectures, and automated online exercsies and quizzes that allow you to check to see if your answers are right or wrong—often many times over—without revealing the correct answers and thereby preventing you from working them out for yourself. The Big Data University course does cover some syntactical nuances of SQL that the Stanford course misses. Moreoever, as someone with more training in statistics than in computers, I already have a decent grasp on the logic used in relational algebra and SQL queries, because it's quite similar to what a statistician uses in recoding variables and filering cases. Nonetheless, I think that the Stanford course, because its exercises ask more of a student, does a better job of teaching rules and syntax.

The Stanford course also handles theoretical subjects better. In the Big Data University course, "Lesson 1" includes relationsal algebra and relational design theory, but the lectures cover only the basics of relational design theory, and skip relational algebra altogether. The rest of the material is relegated to the readings, and while I would normally prefer this, since my reading speed is much faster than the speed of a recorded lecture, these readings, while clearly the result of loving hard work, are not well written. The primary problem seems to be that most if not all of the authors are non-native speakers (one thing you learn grading papers is that native speakers and non-native speakers tend to make entirely different errors), and it's extemely difficult to write coherent text in a language other than your native tongue. Indeed, the reading on relational calculus (which, in fairness, is a subject not covered by the Stanford course) was so difficult to follow that I never did glean even the most basic principles from it—and I'm a person who got 800's on the math and logic sections of the GRE. My criticism, by the way, doesn't apply to the recorded lectures, which are quite understandable even though some of the lecturers are also authors of the written materials, and all of the lecturers appear to be non-native speakers. The Stanford course is not without its own weaknesses: judging by both my own experience and posts in the course's forums, the lectures on relational design theory, especially the sections on decomposition and normalization, simply don't go into enough detail to allow students to grasp the subjects in question and complete the exercises—the ideas are all there, but not always spelled out. Nonetheless, the Stanford course covers even this material better than does the Big Data course.

The main drawback of the Stanford course, obviously, is that it's only offered at specific times. As of this writing, the course is presently in week 5, and it looks like you can still register (the registration page is still up, but I didn't create a fake account just to make sure that it works) and try to catch up; given the amount of work each work (several hours a week, with weeks 2-4 being especially tough, this could be difficult for those lacking time or dedication, but Professor Widom does stress that the course is suitable for "a la carte" learning, picking and choosing the topics of interest. The course materials were available online after the close of the last instance (Look, I used database jargon!) of the class, in Fall 2011, but they were taken down and re-used for this one; hopefully, the materials will again be offered online for those who need to learn about databases before the next round of the course is offered.

Finally, one strong point of the Big Data University course that bears mentioning is that the lectures cover downloading and installing a specific SQL platform, DB2 Express-C. By contrast, the Stanford course relies on a web front-end superimposed on SQLite; that means that students don't have to worry about installing any software (they're invited to install software and download the exercise databases if they like, but there's no requirement to do so, and I've been quite successful in doing everything online), but it is nice that I learned the basics of DB2 in the Big Data Course. The flip side of that, of course, is that DB2 is only one of many SQL platforms, though it's admittedly one of the most popular.

UPDATE: Widom has announced that after the current iteration of Introduction to Databases concludes, all of the class materials (including interactive exercises) will remain online.