Whoops, I haven't posted in a while.
In May, I started a new job. It has nothing to do with data science, but it has given me experience in supervising other writers, and it's also kept me quite busy. The fact that work kept me busy explains why I haven't posted recently. It also explains the one caveat I have to add to the reviews I'm about to give you: I was never able to finish all the material for either course. I got busy with the new job and moving my family into a new house, and by the time I came up for air, it was too late to finish.
As I mentioned in my April post, I signed up for Coursera's Probabilistic Graphical Models, Machine Learning, and An Introduction to Interactive Programming in Python. I dropped the An Introduction to Interactive Programming in Python almost immediately, after realizing that the course's focus on programming video games made it not as useful for my purposes as I had hoped.
Probabilistic Graphical Models was taught by Stanford Professor and Coursera co-founder Daphne Koller. Coursera hasn't yet listed a new iteration of it, but if the previous pattern holds up, it should be offered again next year. As I mentioned before, I took this course because it includes Bayesian and Markov models, both of which show up in many job ads for data scientists. I decided not to take the optional programming track, figuring that it probably wasn't a good idea to be writing programs for two different courses in a language I was just learning (both this course and Machine Learning use MATLAB and/or the very similar Octave).
Machine Learning was taught by Andrew Ng, also a Stanford professor and Coursera co-founder, and is one of Coursera's best-known and most popular courses. It's also been taught by the University of Washington's Pedro Domingos, but Ng's version will be offered again starting October 14th. I signed up for the course because machine learning is one of the basic skills of data science, but I also wanted the chance to learn one of the most commonly used statistical programming languages, MATLAB/Octave.
As I said in my last post, Daphne Koller is not the most charismatic lecturer, and her explanations can be confusing. What I didn't say last time is that I don't think Koller entirely understands the medium in which she's working. In the classroom, asking questions of the professor can make up for a confusing lecture; Koller seems to be giving the same lecture she would give in the classroom, but without the opportunity to stop her and ask questions about each topic before moving on to the next, that same lecture doesn't work very well.
While the lectures are less than ideal, the quizzes are particularly troubling: rather than presenting a simple test of the material covered in the lecture, the quizzes ask students to move beyond the lecture material, drawing out implications on their own. Asking students to do this is a great pedagogical technique, especially in a graduate-level class. However, it works a lot better when the students have discussed the material in class, giving them the opportunity to start down that path together, with the professor's guidance. None of this is possible in an online class, and, even with discussion forums, rules that prevent students from providing answers to one another prevent full exploration of the quiz topics; part of the problem is that students can see the quiz questions before beginning their discussion, rather than receiving a quiz or homework assignment only after the classroom discussion is over.
It might help to begin each quiz with more straightforward questions, giving students a little practice, before moving on the ones that require additional thinking. Far from adopting this model, Koller actually exacerbates the problem by adopting an unusually strict rule (by MOOC standards) for retaking quizzes: any attempt after the second is penalized. Because of this, I found myself taking quizzes I had no way to prepare for, because they introduced concepts for the first time, and I had no way to practice applying those concepts beforehand.
I want to stress here that I'm not simply some idiot who was in over my head. I'm trained in statistics, and I have experience using structural equation and time series models, both of which share similarities with probabilistic graphical models—and I was really interested in the course material. Koller acknowledges in her lectures that the course is challenging, and even seems to take pride in that fact. However, while the material is indeed challenging, the course is hard partly because it's badly taught. It's also possible that Koller is trying to cover too much material for the online format—the lack of classroom discussion not only makes individual topics more difficult, but increases the time required to cover each topic, since the teacher has to provide a much more detailed lecture, rather than relying on student questions to fill in holes.
While I didn't pursue the programming track, other reviewers have complained that they spent more time trying to figure out how to read in the data than they did conducting the analysis. Mind you, this is a problem that data scientists face in the real world, and so the criticism might not be completely fair.
Andrew Ng's Machine Learning is another beast altogether. Ng is in fact a charismatic, and very clear, lecturer; indeed, Koller uses a couple of his lectures in areas where the material in the two courses overlaps. Not only does Ng convey his topics clearly, but he stresses the practical aspects of the methods he's teaching, and provides useful tips about how to apply them in the real world. While Ng pulls students along at pace much gentler than Koller's, he's still able to teach methods that, he insists, are advanced enough to be unfamiliar to many practicing data scientists. I should add that the automated system used to grade programming assignments works quite well. If I do have one criticism, it's that the programming assignments probably involve a little more copying and pasting than might be ideal for learning Octave, but then, copying and pasting isn't uncommon in real programming.
In short, this is a very good course, and I strongly recommend signing up for the session that starts October 14th. Now that things have calmed down a bit for me, I might even sign up for it myself.
Tuesday, October 8, 2013
Wednesday, April 24, 2013
Online Course Reviews: Coursera's Social Network Analysis and Foundations of Business Strategy—Plus New Courses to Check Out
I've recently completed Foundations of Business Strategy, taught by the University of Virginia's Michael J. Lenox, and I've submitted the final project for Social Network Analysis, taught by the University of Michigan's Lada Adamic, and I'd like to share some comments on both of these oferrings from Coursera, as well as give readers a heads-up to other courses that have just started.
Social Network Analysis provided a good survey of the methods and applications in the field, covering random networks, measures of centrality, small world networks (and other topics related to the question of optimization), and the dynamic aspects of networks, such as contagion and opinion formation. Adamic's explanations were usually clear, and even a student with little knowledge of probability could have gotten the gist of most of the course material (and made use of Gephi to perform basic analysis), but equations were presented for those who wanted them, and the readings gave further detail. In fact, this is the only course I've had so far that made extensive use of academic journal articles (and a few written for a wider audience), some of them required and some recommended—they gave a much better impression of the history of social network analysis and the current state of the art than Professor Adamic could have given by herself. From a personal perspective, this topic particularly interests me because I can see how social network analysis might be applied to the study of ethnic politics, my previous area of research.
The course's only weakness lay in the (optional) programming track: the first three programming assignments, two in R and one in NetLogo, were largely exercises in copy-and-paste, rather than posing full-fledged coding tasks; they were, however, enough to give students basic familiarity with the two programming languages, and with the igraph package for R. In contrast to these "canned" assignments, the final project was almost completely unstructured, and while this provided welcome freedom to explore whatever topic a student wished, it also meant a steep learning curve for someone whose experience with R or NetLogo was limited to the earlier exercises.
Compared to the other courses I've taken, Foundations of Business Strategy proved much less time-consuming, with required readings limited to very short chapters from a forthcoming book by Lenox (and when I say "short", I mean it's more a pamphlet than a book, with chapters only a few pages long), and a business case each week. The professor encouraged discussion of each case, both in small groups and in the discussion forum, and each week recorded a debriefing that made reference to students' comments in the forum; however, the only assignments that needed to ber turned in before the final project were (relatively easy) weekly quizzes that covered the lecture topics. Quizzes, hence the lectures the quizzes covered, could be completed at any time during the course, though completing the lectures late meant that a student had no chance to participate in the discussios of the associated cases. The final project was a 1500-word "executive summary" of a strategic analysis of an organization of the each student's choice.
Despite the sparseness of the course material, the class provided a useful framework for business strategy, a framework—and this is the part that surprised and impressed me, after all the scurrilous rumors I've heard about business schools, and the weak business students I've taught in my own classes—that was solidly grounded in microeconomics, with no mention at all made of the latest management fads. No, someone who took this course isn't guaranteed to become a strategic genius, or even, necessarily, an effective strategic thinker, but that's because strategy requires making decisions in an environment that's inherently complex and ambiguous, the upshot of which is that giving students a good framework for organizing thought—and a chance to practice strategic thinking on real cases—is about the best that a teacher can do.
My one serious concern with the course was the rubric used for peer review of the final project: the assignment presented a set of criteria for grading that focused largely on the quality of analysis and writing, and specifically warned against trying to use all the methods of analysis that had been covered in the course; by contrast, the rubric that was actually used for peer grading of the assignment amounted to a checklist of topics in the course, and penalized students for not covering each one, while leaving little room for judging the quality of the report. As a former professor, I certainly understand that detailed rubrics, while beloved by students, tend to push attention in grading towards mechanical aspects of the assignment, and this probably goes double for peer assessment, but it is possible to create rubrics that give more or less clear guidance for making qualitative judgments. More importantly, whatever rubric is used needs to match the criteria that are spelled out in the original assignment.
New Courses to Check Out
I've signed up for three courses that have just started, though odds are I'll need to withdraw from one of them, due to time constraints. I stumbled upon Probabilistic Graphical Models—the term "graphical" didn't suggest anything I was interested in, but a look at the course description revealed that probabalistic graphical models (PGM's) includes Bayesian and Markov networks, both of which feature in decision-making and machine learning, and both of which show up repatedly in job ads for data scientists. Coursera co-founder Daphne Koller, of Stanford University, lacks the charisma and clear explanations of the three MOOC professors I've previously learned from; she also comes off (if I may be subjective here) as a bit pretentious, an impression that makes her inclusion of the Simpsons' family tree as an example of a genealogical network more cringeworthy than cool. I've found most of the material so far to be readily understandable, but I'm guessing it wouldn't be for someone without my background in statistics (especially since I've used by structural equation and time series models in my research, and these feature many of the same concepts found in PGM's). This is a graduate-level course, and Koller herself describes it as challenging even by those standards. Needless to say, the other side of that coin is that anyone who gets through the course will have a solid foundation in PGM's, and also, for those taking the programming track, knowledge of Octave and/or MATLAB (the two are close relatives), especially given weekly programming assignments in a 11-week course.
I've also just started An Introduction to Interactive Programming in Python, taught by multiple instructors from Rice University, and Machine Learning, this iteration taught by Coursera co-founder Andrew Ng of Stanford, one of two professors on Coursera to teach this course; like Probabilistic Graphical Models, Machine Learning makes use of Octave. I doubt however that I'll have time for all three courses, meaning that I'll likely have to withdraw from one of them, which will probably be Probabilistic Graphical Models or Machine Learning, given their overlap in programming language, and, to a lesser extent, subject matter.
Social Network Analysis provided a good survey of the methods and applications in the field, covering random networks, measures of centrality, small world networks (and other topics related to the question of optimization), and the dynamic aspects of networks, such as contagion and opinion formation. Adamic's explanations were usually clear, and even a student with little knowledge of probability could have gotten the gist of most of the course material (and made use of Gephi to perform basic analysis), but equations were presented for those who wanted them, and the readings gave further detail. In fact, this is the only course I've had so far that made extensive use of academic journal articles (and a few written for a wider audience), some of them required and some recommended—they gave a much better impression of the history of social network analysis and the current state of the art than Professor Adamic could have given by herself. From a personal perspective, this topic particularly interests me because I can see how social network analysis might be applied to the study of ethnic politics, my previous area of research.
The course's only weakness lay in the (optional) programming track: the first three programming assignments, two in R and one in NetLogo, were largely exercises in copy-and-paste, rather than posing full-fledged coding tasks; they were, however, enough to give students basic familiarity with the two programming languages, and with the igraph package for R. In contrast to these "canned" assignments, the final project was almost completely unstructured, and while this provided welcome freedom to explore whatever topic a student wished, it also meant a steep learning curve for someone whose experience with R or NetLogo was limited to the earlier exercises.
Compared to the other courses I've taken, Foundations of Business Strategy proved much less time-consuming, with required readings limited to very short chapters from a forthcoming book by Lenox (and when I say "short", I mean it's more a pamphlet than a book, with chapters only a few pages long), and a business case each week. The professor encouraged discussion of each case, both in small groups and in the discussion forum, and each week recorded a debriefing that made reference to students' comments in the forum; however, the only assignments that needed to ber turned in before the final project were (relatively easy) weekly quizzes that covered the lecture topics. Quizzes, hence the lectures the quizzes covered, could be completed at any time during the course, though completing the lectures late meant that a student had no chance to participate in the discussios of the associated cases. The final project was a 1500-word "executive summary" of a strategic analysis of an organization of the each student's choice.
Despite the sparseness of the course material, the class provided a useful framework for business strategy, a framework—and this is the part that surprised and impressed me, after all the scurrilous rumors I've heard about business schools, and the weak business students I've taught in my own classes—that was solidly grounded in microeconomics, with no mention at all made of the latest management fads. No, someone who took this course isn't guaranteed to become a strategic genius, or even, necessarily, an effective strategic thinker, but that's because strategy requires making decisions in an environment that's inherently complex and ambiguous, the upshot of which is that giving students a good framework for organizing thought—and a chance to practice strategic thinking on real cases—is about the best that a teacher can do.
My one serious concern with the course was the rubric used for peer review of the final project: the assignment presented a set of criteria for grading that focused largely on the quality of analysis and writing, and specifically warned against trying to use all the methods of analysis that had been covered in the course; by contrast, the rubric that was actually used for peer grading of the assignment amounted to a checklist of topics in the course, and penalized students for not covering each one, while leaving little room for judging the quality of the report. As a former professor, I certainly understand that detailed rubrics, while beloved by students, tend to push attention in grading towards mechanical aspects of the assignment, and this probably goes double for peer assessment, but it is possible to create rubrics that give more or less clear guidance for making qualitative judgments. More importantly, whatever rubric is used needs to match the criteria that are spelled out in the original assignment.
New Courses to Check Out
I've signed up for three courses that have just started, though odds are I'll need to withdraw from one of them, due to time constraints. I stumbled upon Probabilistic Graphical Models—the term "graphical" didn't suggest anything I was interested in, but a look at the course description revealed that probabalistic graphical models (PGM's) includes Bayesian and Markov networks, both of which feature in decision-making and machine learning, and both of which show up repatedly in job ads for data scientists. Coursera co-founder Daphne Koller, of Stanford University, lacks the charisma and clear explanations of the three MOOC professors I've previously learned from; she also comes off (if I may be subjective here) as a bit pretentious, an impression that makes her inclusion of the Simpsons' family tree as an example of a genealogical network more cringeworthy than cool. I've found most of the material so far to be readily understandable, but I'm guessing it wouldn't be for someone without my background in statistics (especially since I've used by structural equation and time series models in my research, and these feature many of the same concepts found in PGM's). This is a graduate-level course, and Koller herself describes it as challenging even by those standards. Needless to say, the other side of that coin is that anyone who gets through the course will have a solid foundation in PGM's, and also, for those taking the programming track, knowledge of Octave and/or MATLAB (the two are close relatives), especially given weekly programming assignments in a 11-week course.
I've also just started An Introduction to Interactive Programming in Python, taught by multiple instructors from Rice University, and Machine Learning, this iteration taught by Coursera co-founder Andrew Ng of Stanford, one of two professors on Coursera to teach this course; like Probabilistic Graphical Models, Machine Learning makes use of Octave. I doubt however that I'll have time for all three courses, meaning that I'll likely have to withdraw from one of them, which will probably be Probabilistic Graphical Models or Machine Learning, given their overlap in programming language, and, to a lesser extent, subject matter.
Thursday, March 14, 2013
Online Courses to Check Out
Having nearly completed Stanford Professor Jennifer Widom's Introduction to Databases, I've recently begun two more massive open online courses (MOOC's) that you, the reader, might want to take a look at, both of them offered by Coursera.
The first course is Social Network Analysis, taught by Lada Adamic of the University of Michigan. This methodology, which can be applied to topics as divergent as infrastructure and epedemiology (as well as the more obvious targets such as Facebook), obviously plays a prominent role in data science, which is one reason to take the course. A second reason is that the course features an optional programming track with four assignments (including a peer-graded final project), some using NetLogo and some using R, and in my case I'm taking the course in part as a way to learn R. The course also makes use of Gephi for basic network analysis. In the second week, there are two versions of the lectures, with an advanced version for students with a background in probability distributions and differential equations; it's not clear if this will be the case in later weeks. This is a nine-week class, and if you're reading this soon after I've posted it, you can still sign up and get full credit, since the first assignment isn't due until Friday night (March 15th).
Taking the advice of one of my contacts to learn something about business, I've also signed up for a non-technical course, Foundations of Business Strategy, taught by Michael J. Lenox of the Unviersity of Virginia. This six-week class features a textbook that Lenox is currently developing, as well as the case method typical of business-school education (Lenox recommends small-group discussion to get the full impact of this method). The most interesting feature of the course is a peer-graded final assignment in which each student writes a short but well-researched strategy memo for a the CEO of a company of his or her choice; more interesting still, Lenox has invited organizations that would like their strategy assessed to join the course and offer themselves as cases for the students' final projecdts. Though we're already about 25% of the way through the course, the assignments all have the same deadline of April 14th, and so it's easy to catch up.
You might also keep a lookout for two courses starting the latter half of April, An Introduction to Interactive Programming in Python, taught by a team from Rice University, and the perennially popular Machine Learning, taught by Coursera co-founder Andrew Ng of Stanford (this one uses Octave, a close relative of MATLAB, for those keeping track of programming languages).
The first course is Social Network Analysis, taught by Lada Adamic of the University of Michigan. This methodology, which can be applied to topics as divergent as infrastructure and epedemiology (as well as the more obvious targets such as Facebook), obviously plays a prominent role in data science, which is one reason to take the course. A second reason is that the course features an optional programming track with four assignments (including a peer-graded final project), some using NetLogo and some using R, and in my case I'm taking the course in part as a way to learn R. The course also makes use of Gephi for basic network analysis. In the second week, there are two versions of the lectures, with an advanced version for students with a background in probability distributions and differential equations; it's not clear if this will be the case in later weeks. This is a nine-week class, and if you're reading this soon after I've posted it, you can still sign up and get full credit, since the first assignment isn't due until Friday night (March 15th).
Taking the advice of one of my contacts to learn something about business, I've also signed up for a non-technical course, Foundations of Business Strategy, taught by Michael J. Lenox of the Unviersity of Virginia. This six-week class features a textbook that Lenox is currently developing, as well as the case method typical of business-school education (Lenox recommends small-group discussion to get the full impact of this method). The most interesting feature of the course is a peer-graded final assignment in which each student writes a short but well-researched strategy memo for a the CEO of a company of his or her choice; more interesting still, Lenox has invited organizations that would like their strategy assessed to join the course and offer themselves as cases for the students' final projecdts. Though we're already about 25% of the way through the course, the assignments all have the same deadline of April 14th, and so it's easy to catch up.
You might also keep a lookout for two courses starting the latter half of April, An Introduction to Interactive Programming in Python, taught by a team from Rice University, and the perennially popular Machine Learning, taught by Coursera co-founder Andrew Ng of Stanford (this one uses Octave, a close relative of MATLAB, for those keeping track of programming languages).
Tuesday, February 26, 2013
Is There an Academic Role for the Social Sciences in Data Science?
I've spent most of the space in this blog exploring routes for a social scientist to become a data scientist. However, Justin Kern's recent article "The State of Business Intelligence in Academics" has turned my thoughts briefly back to academia. Kern reports on a survey by the organizers of BI Congress that found that, although business intelligence courses are taught primariliy by information technology and management information systems departments, they're increasingly being offered outside those discplines, in particular in finance, marketing, and accounting.
I wonder what role social scientists should be playing here? Should we merely be taking classes from other departments, or should economics, poiltical science, and sociology departments have their own offerings in data science? (Since this survey reported specifically on business intelligence courses, it's possible that similar courses in the broader field of data science were missed, but I would suspect that such offerings are pretty rare in the social sciences.) I'd love to get some comments on this subject.
I wonder what role social scientists should be playing here? Should we merely be taking classes from other departments, or should economics, poiltical science, and sociology departments have their own offerings in data science? (Since this survey reported specifically on business intelligence courses, it's possible that similar courses in the broader field of data science were missed, but I would suspect that such offerings are pretty rare in the social sciences.) I'd love to get some comments on this subject.
Monday, February 18, 2013
Political Ideology and Consumer Brand Choice: Applying Social Science to a Marketing Problem
This morning Shep Parke sent me a copy of Harvard Business School newsletter The Daily Stat, which links to "Ideology and Brand Consumption", a 2011 paper from marketing professors Romana Khan, Kanishka Misra, and Vishal Singh. The jist is that buyers of consumer packaged goods (CPG's) in counties that have voted more Republican or where more people go to church buy more goods from established brands and fewer generic or store-branded goods, and also fewer goods from newer brands.
The authors' explanation of these findings is that people who have conservative ideologies are more fond of tradition and the status quo, and wary of change. The paper obviously caught my eye because of the political component, especially as it touches on voting behavior, which is one of my specialties. I thought that this article might provide me a chance to give an example of how a social scientist can be valuable in analyzing business data.
The value of a social scientist here might not be obvious: after all, three social scientists already did the hard work, and now, don't we have an actionable insight for producers of consumer products? Not quite: the paper's findings are interesting, and I'd have no qualms publishing them in an academic journal as a starting point for discussion, but I wouldn't risk money and brand loyalty on something this vague and uncertain. As a social scientist, not only can I identify the questions that this research doesn't answer, but I can also suggest some practical ways to answer those questions. The short version is that we need more detailed data, preferably at the individual level, and that we might even want to run a few experiments to test our hypotheses.
There are two basic issues here. The first is whether we can make the jump from county-level data to individual consumers. The second is whether conservative personality traits actually influence shoppers' buying choices, or whether there's something else going on that just happens to be related to both factors. Let's address the problem of county-level data first. The problem we have is that we know more products of certain types are leaving the shelves in conservative counties, but we don't know exactly who's buying them. For example, it's logically possible (if unlikely) that the liberals in conversative counties buy more goods from established brands than the liberals in other places.
A more realistic concern is that, as any student of market segmentation can tell you, human psychology doesn't divide us neatly into two big groups, "conservatives" and "liberals". The fact that a candidate has to win a majority of the electorate leads voters naturally to bunch up into two competing groups , but there's a lot of diversity within each of those groups, as all of us who went to college have probably seen in the two-dimensional political graph that college Libertarian clubs like to trot out. But though we only have two choices as voters, as consumers, we have many more, and people who vote together might not shop together.
In the nineteenth century, when the big issues were things like freeing the slaves or allowing men without property to vote, we could safely say that conservative people favored the status quo, and no doubt many "conservatives" still do, but are those people who seek the safety of the well-known really the same as Tea Party members who want a revolution to roll back decades of big government, or the libertarians who favor gay marriage as ardently as they do low taxes? If it's actually just one group of Republican voters that favors the tried and true, we'd get a lot more bang for our marketing buck by focusing directly on them, or, at least, on places where they make up the biggest part of the population. There are other questions we could ask here, but you get the idea.
Even if we can identify the relevant segment of conservative voters who buy established brands, how can we be sure their conservative personality traits are what lead them to make those buying decisions? This question of causality is the one that, more than any other, keeps social scientists up at night. Sure, the psychological explanation offered by the paper's authors is a plausible story, but you can create lots of different plausable stories to explain any given set of facts (anyone who doesn't believe that should consider how quickly the latest management and marketing advice changes).
Here's one plausible story: the authors looked at sales from the same chain of stores in different counties, but the same chain typically offers a different mix of products at different stores. Places that are less densely populated typically have smaller stores, which offer a narrower range of goods, and I'd be willing to bet that where stores offer a narrower range, those offerings are dominated by established brands. And do you know what else is true of less populated places (that is, rural areas vs. big cities)? They tend to be more conservative. In other words, it's entirely possible that people in conservative counties buy more established brands because they don't have much of a choice.
How does a social scientist address these issues to pull out some information that we can act on? First of all, I'd try to find some individual-level data. We may already have data on individual brand choices from store loyalty cards, but to make that useful, we need individual-level pyschological data—that is, we need to know that specific individuals with personality trait X buy brand Y. The closest thing to that we're likely to have is demographic data on the holders of loyalty cards (both the data we gather from the loyalty card program, and data we can obtain from other sources and join with the loyalty-card data), but that may actually be counterproductive: sure, people with high incomes are more likely to vote Republican, but what actually interests us is people who vote Republican because they favor tradition, and their demographic data doesn't tell us a whole lot about that or any other personality trait. We're not, after all, actually interested in whether or not people vote Republican, but in the personality traits that make them both vote Republican and buy one brand rather than another.
In the end, assuming I work for a retailer with a loyalty-card program, I would try to survey a sample of card-holders (perhaps we could offer them coupons or some other incentives to participate). Actually, I probably wouldn't even ask political questions in the survey, because personality traits are what we're actually after (even if the observation about politics was what inspired us in the first place), and political questions might well offend our shoppers.
Getting individual-level data would get us closer to showing that personality traits cause consumers to make particular brand choices, both by looking directly at the traits and choices, and by allowing us to rule out other possible causes—for example, we could look at demographic data and psychological data at the same time in order to see which are related more closely to brand choices. And frankly, in the social sciences, that's about the best we can usually do. The "gold standard", though, is randomized experiments, because, if you divide people into two, randomly-chosen and essentially identical, groups, and then do X to one and Y to the other, you can be pretty sure that any differences you see after that point are due to the difference between X and Y. We rarely do experiments that look at behavior in the real world (as opposed to a psychology lab), because they're expensive and they pose ethical questions, but they're pretty viable for a big retailier with outlets all over the place.
With personality traits, a true experiment is never possible, because you can't force people to have certain traits (how many parents, teachers, and managers have wished it were otherwise?), but we could, for example, make sure that two (or more) stores in places with different sorts of shoppers carried exactly the same selections in one or a few product groups, thus ruling out different lineups as a cause of different brand choices. This might cost us money (lost sales or extra inventory), but unlike an academic researcher, we can recoup that cost in higher sales that result from the new information.
It's interesting to note that this experimental approach can yield results even without individual-level data—that's the logic behind introducing products in test markets, after all—but if we can combine the experiment with a survey of the shoppers at the stores taking part in the experiment (even though using both the survey and the experiment is our most expensive option), we can leverage the data from the experiment and the survey to get more benefit out of both.
I'm an Author a Musician? Who Knew?
I've noticed that many of the books advertised by Amazon on the sidebar of this blog are by one "Scott Orr". Just in case it isn't already clear, that Scott Orr is not the same Scott Orr writing this blog. Heck, I don't even know who that guy is, though I'm sure he's perfectly nice.
UPDATE:. Actually, yes, I am an author. I wrote this, after all. I've written other things, too, for that matter, many of them published. More to the point, my namesake appears, on closer examination, to be selling music.
The authors' explanation of these findings is that people who have conservative ideologies are more fond of tradition and the status quo, and wary of change. The paper obviously caught my eye because of the political component, especially as it touches on voting behavior, which is one of my specialties. I thought that this article might provide me a chance to give an example of how a social scientist can be valuable in analyzing business data.
The value of a social scientist here might not be obvious: after all, three social scientists already did the hard work, and now, don't we have an actionable insight for producers of consumer products? Not quite: the paper's findings are interesting, and I'd have no qualms publishing them in an academic journal as a starting point for discussion, but I wouldn't risk money and brand loyalty on something this vague and uncertain. As a social scientist, not only can I identify the questions that this research doesn't answer, but I can also suggest some practical ways to answer those questions. The short version is that we need more detailed data, preferably at the individual level, and that we might even want to run a few experiments to test our hypotheses.
There are two basic issues here. The first is whether we can make the jump from county-level data to individual consumers. The second is whether conservative personality traits actually influence shoppers' buying choices, or whether there's something else going on that just happens to be related to both factors. Let's address the problem of county-level data first. The problem we have is that we know more products of certain types are leaving the shelves in conservative counties, but we don't know exactly who's buying them. For example, it's logically possible (if unlikely) that the liberals in conversative counties buy more goods from established brands than the liberals in other places.
A more realistic concern is that, as any student of market segmentation can tell you, human psychology doesn't divide us neatly into two big groups, "conservatives" and "liberals". The fact that a candidate has to win a majority of the electorate leads voters naturally to bunch up into two competing groups , but there's a lot of diversity within each of those groups, as all of us who went to college have probably seen in the two-dimensional political graph that college Libertarian clubs like to trot out. But though we only have two choices as voters, as consumers, we have many more, and people who vote together might not shop together.
In the nineteenth century, when the big issues were things like freeing the slaves or allowing men without property to vote, we could safely say that conservative people favored the status quo, and no doubt many "conservatives" still do, but are those people who seek the safety of the well-known really the same as Tea Party members who want a revolution to roll back decades of big government, or the libertarians who favor gay marriage as ardently as they do low taxes? If it's actually just one group of Republican voters that favors the tried and true, we'd get a lot more bang for our marketing buck by focusing directly on them, or, at least, on places where they make up the biggest part of the population. There are other questions we could ask here, but you get the idea.
Even if we can identify the relevant segment of conservative voters who buy established brands, how can we be sure their conservative personality traits are what lead them to make those buying decisions? This question of causality is the one that, more than any other, keeps social scientists up at night. Sure, the psychological explanation offered by the paper's authors is a plausible story, but you can create lots of different plausable stories to explain any given set of facts (anyone who doesn't believe that should consider how quickly the latest management and marketing advice changes).
Here's one plausible story: the authors looked at sales from the same chain of stores in different counties, but the same chain typically offers a different mix of products at different stores. Places that are less densely populated typically have smaller stores, which offer a narrower range of goods, and I'd be willing to bet that where stores offer a narrower range, those offerings are dominated by established brands. And do you know what else is true of less populated places (that is, rural areas vs. big cities)? They tend to be more conservative. In other words, it's entirely possible that people in conservative counties buy more established brands because they don't have much of a choice.
How does a social scientist address these issues to pull out some information that we can act on? First of all, I'd try to find some individual-level data. We may already have data on individual brand choices from store loyalty cards, but to make that useful, we need individual-level pyschological data—that is, we need to know that specific individuals with personality trait X buy brand Y. The closest thing to that we're likely to have is demographic data on the holders of loyalty cards (both the data we gather from the loyalty card program, and data we can obtain from other sources and join with the loyalty-card data), but that may actually be counterproductive: sure, people with high incomes are more likely to vote Republican, but what actually interests us is people who vote Republican because they favor tradition, and their demographic data doesn't tell us a whole lot about that or any other personality trait. We're not, after all, actually interested in whether or not people vote Republican, but in the personality traits that make them both vote Republican and buy one brand rather than another.
In the end, assuming I work for a retailer with a loyalty-card program, I would try to survey a sample of card-holders (perhaps we could offer them coupons or some other incentives to participate). Actually, I probably wouldn't even ask political questions in the survey, because personality traits are what we're actually after (even if the observation about politics was what inspired us in the first place), and political questions might well offend our shoppers.
Getting individual-level data would get us closer to showing that personality traits cause consumers to make particular brand choices, both by looking directly at the traits and choices, and by allowing us to rule out other possible causes—for example, we could look at demographic data and psychological data at the same time in order to see which are related more closely to brand choices. And frankly, in the social sciences, that's about the best we can usually do. The "gold standard", though, is randomized experiments, because, if you divide people into two, randomly-chosen and essentially identical, groups, and then do X to one and Y to the other, you can be pretty sure that any differences you see after that point are due to the difference between X and Y. We rarely do experiments that look at behavior in the real world (as opposed to a psychology lab), because they're expensive and they pose ethical questions, but they're pretty viable for a big retailier with outlets all over the place.
With personality traits, a true experiment is never possible, because you can't force people to have certain traits (how many parents, teachers, and managers have wished it were otherwise?), but we could, for example, make sure that two (or more) stores in places with different sorts of shoppers carried exactly the same selections in one or a few product groups, thus ruling out different lineups as a cause of different brand choices. This might cost us money (lost sales or extra inventory), but unlike an academic researcher, we can recoup that cost in higher sales that result from the new information.
It's interesting to note that this experimental approach can yield results even without individual-level data—that's the logic behind introducing products in test markets, after all—but if we can combine the experiment with a survey of the shoppers at the stores taking part in the experiment (even though using both the survey and the experiment is our most expensive option), we can leverage the data from the experiment and the survey to get more benefit out of both.
I'm
I've noticed that many of the books advertised by Amazon on the sidebar of this blog are by one "Scott Orr". Just in case it isn't already clear, that Scott Orr is not the same Scott Orr writing this blog. Heck, I don't even know who that guy is, though I'm sure he's perfectly nice.
UPDATE:. Actually, yes, I am an author. I wrote this, after all. I've written other things, too, for that matter, many of them published. More to the point, my namesake appears, on closer examination, to be selling music.
Friday, February 15, 2013
Stanford's Introduction to Databases vs. Big Data University's SQL Fundamentals I: A Comparison and Review of Online Courses
It's been too long since I last posted here. To put things simply, I found that I was spending more time maintaining the blog as a resource for others to learn data science than I was spending actually learning data science myself. Now that I've got more experience with online courses under my belt, I'd like to share my insights.
When I first became interested in data science, I began to take Big Data University's SQL Fundamentals I course. This course uses IBM's free DB2 Express-C platform, and content and links on the DB2 webpages, as well as the branding in some of the older course material, indicate a connection of some sort between Big Data University and IBM itself, though I couldn't find any statement of the nature of this connection. Posts in the Big Data forums mention that SQL Fundamentals I was originally a true course, with a schedule and interaction between teachers and students; it's now a self-paced course that makes use of video lectures by a variety of instructors, exercises (downloaded in PDF form), required reading (from free e-books produced by the DB2 community), and a final exam. It covers not only the basics of SQL (including queries and database modification), but theoretical modules as well, specifically, relational algebra and relational design theory.
Stanford's Stanford University's Class2Go offers only a small number of courses (three, at present), but these include a 10-week Introduction to Databases, taught by Professor Jennifer Widom, which was originally offered in Fall Quarter of 2011. It's being offered again now, in Winter Quarter of 2013. Rather than partnering with a third-party provider of massive open online courses (MOOC's), such as Coursera or edX, Stanford has opted to go it alone, hosting the course on its recently established Class2Go (interestingly, Stanford professor Andrew Ng's popular Machine Learning course is still on Coursera, which Ng co-founded). Introduction to Databases uses video lectures, interactive online quizzes and exercises, and exams; supplemental readings are suggsted but not required. This review will address the parts of the course that cover the same subjects as the Big Data University course: parts of the introduction, relationsal algebra (most of week 2), SQL (week 3), and relational design theory (week 4); the 10-week course also covers XML, UML, OLAP, NoSQL, and some advanced SQL topics, such as triggers, views, and authorizations.
First of all, full disclosure: I didn't get past "Getting Started" and "Lesson 1" in the Big Data University course before I started taking the Stanford one. That may not sound like a lot, but it includes most of the reading (seven of the eight chapters) in the course—and the most difficult reading—and I got bogged down with that, though I did finish all of it. That means, ironically enough, that I never completeed the actual SQL portions of the Big Data course, though I've since examined some of the lectures and exercises for those portions. I've been through only the first four weeks of the Stanford course, but I wanted to publish this review in time for readers to join the course late and still be able to get something out of it.
On the whole, I think the Stanford course is the better one, mostly because the lectures contain more material, and the exercises are more demanding. Specifically, the Stanford course features lectures that go into greather depth, and exercises in relational algebra and writing SQL queries that require a lot more thought than the Big Data exercises on the same subjects; there's more emphasis on the logic being applied, rather than merely learning rules and syntax, but, as is often the case, struggling with difficult problems helps to solidify memory of rules and syntax. It also helps that the Stanford course is more interactive: I haven't made any use of the virtual office hours provided by the course's teaching assistant, and both courses have forums, but the Stanford course has a few nice extra features, such as short quizzes during lectures, and automated online exercsies and quizzes that allow you to check to see if your answers are right or wrong—often many times over—without revealing the correct answers and thereby preventing you from working them out for yourself. The Big Data University course does cover some syntactical nuances of SQL that the Stanford course misses. Moreoever, as someone with more training in statistics than in computers, I already have a decent grasp on the logic used in relational algebra and SQL queries, because it's quite similar to what a statistician uses in recoding variables and filering cases. Nonetheless, I think that the Stanford course, because its exercises ask more of a student, does a better job of teaching rules and syntax.
The Stanford course also handles theoretical subjects better. In the Big Data University course, "Lesson 1" includes relationsal algebra and relational design theory, but the lectures cover only the basics of relational design theory, and skip relational algebra altogether. The rest of the material is relegated to the readings, and while I would normally prefer this, since my reading speed is much faster than the speed of a recorded lecture, these readings, while clearly the result of loving hard work, are not well written. The primary problem seems to be that most if not all of the authors are non-native speakers (one thing you learn grading papers is that native speakers and non-native speakers tend to make entirely different errors), and it's extemely difficult to write coherent text in a language other than your native tongue. Indeed, the reading on relational calculus (which, in fairness, is a subject not covered by the Stanford course) was so difficult to follow that I never did glean even the most basic principles from it—and I'm a person who got 800's on the math and logic sections of the GRE. My criticism, by the way, doesn't apply to the recorded lectures, which are quite understandable even though some of the lecturers are also authors of the written materials, and all of the lecturers appear to be non-native speakers. The Stanford course is not without its own weaknesses: judging by both my own experience and posts in the course's forums, the lectures on relational design theory, especially the sections on decomposition and normalization, simply don't go into enough detail to allow students to grasp the subjects in question and complete the exercises—the ideas are all there, but not always spelled out. Nonetheless, the Stanford course covers even this material better than does the Big Data course.
The main drawback of the Stanford course, obviously, is that it's only offered at specific times. As of this writing, the course is presently in week 5, and it looks like you can still register (the registration page is still up, but I didn't create a fake account just to make sure that it works) and try to catch up; given the amount of work each work (several hours a week, with weeks 2-4 being especially tough, this could be difficult for those lacking time or dedication, but Professor Widom does stress that the course is suitable for "a la carte" learning, picking and choosing the topics of interest. The course materials were available online after the close of the last instance (Look, I used database jargon!) of the class, in Fall 2011, but they were taken down and re-used for this one; hopefully, the materials will again be offered online for those who need to learn about databases before the next round of the course is offered.
Finally, one strong point of the Big Data University course that bears mentioning is that the lectures cover downloading and installing a specific SQL platform, DB2 Express-C. By contrast, the Stanford course relies on a web front-end superimposed on SQLite; that means that students don't have to worry about installing any software (they're invited to install software and download the exercise databases if they like, but there's no requirement to do so, and I've been quite successful in doing everything online), but it is nice that I learned the basics of DB2 in the Big Data Course. The flip side of that, of course, is that DB2 is only one of many SQL platforms, though it's admittedly one of the most popular.
UPDATE: Widom has announced that after the current iteration of Introduction to Databases concludes, all of the class materials (including interactive exercises) will remain online.
UPDATE: Widom has announced that after the current iteration of Introduction to Databases concludes, all of the class materials (including interactive exercises) will remain online.
Subscribe to:
Posts (Atom)