Below, I've included links to a number of useful resources for social scientists who want to become data scientists; to make the list even more useful, I've added a little bit of information about each link. If you have any suggestions for additional resources I can list, please let me know! I'd also like to be alerted to broken links, of course.
Note that you can use the links in the sidebar to jump to a specific section of this page from here or anywhere else in the blog.
Defining the Data Scientist—and Why Anyone Would Hire You as One
The articles below move from defining data science, to describing what data scientists do, to giving practice advice on becoming a data scientist:>
"What Is Data Science? What Is Analytics? What Is a Data Scientist?"
Davenport and Patil, "Data Scientist: The Sexiest Job of the 21st Century (Executive Summary)"
Press, "Big Data News of the Week: Sexy and Social Data Scientists"
Miller, "Data as a Social Science"
Dyche, "Big Data 'Eurekas!' Don't Just Happen"
Roe, "So You Want to Be a Data Scientist?"
Koploy, "Three Career Secrets for Aspiring Data Scientists"
Koploy, "Advice for the Aspiring Data Scientist"
Roy, "What to Look for When Hiring a BI Specialist"
"On Becoming a Data Scientist", "Part 1—The Destination", "Part 2—The Technical Core, for Free", "Part 3—The Softer Side", and "Part 4—Managing". (Part 2, incidentally, has a long list of resources to help in learning R.)
And last but not least, some advice (not specific to data scientists) for moving from academia to industry:
Khalilov, "A Guideline to Move from Academia to Industry—Part 1" and
"A Guideline to Move from Academia to Industry—Part 2"
Wood, "The Ph.D.'s Guide to a Nonfaculty Job Search"
As the articles linked above make obvious, there's considerable disagreement over what a data scientist is and where they come from. Everyone agrees that data scientists attempt to extract useful information from big data (another term open to interpretation, by the way). However, some writers focus on how data scientists approach problems; they see data scientists as researchers who bring a particular way of seeking knowledge (the scientific method, with its rigorous approach to testing hypotheses through statistical analysis of experimental and quasi-experimental designs) to a new domain. Other writers focus on the computer skills required to use the tools that data scientists use; they see the data scientists as an improved version of the traditional data analyst, adding new tools such as MapReduce and Hadoop to the traditional skill set of SQL, Java, C++, and the like. There's also a third school of thought, which I've seen in job ads, that focuses on experience in marketing research; individuals with this experience often possess business degrees, though they typically have picked up some statistical and database skills as well.
Real data scientists fit all of these molds. More to the point, for an aspiring data scientist, real employers may subscribe to one vision or another, and post job ads that reflect that point of view. If you've been trained as a social scientist, odds are you don't have the 5+ years of SQL or C++ experience that some employers seek. On the other hand, even if we're talking about technical skills, someone trained as a database administrator probably doesn't know a whole lot about advanced econometrics or sampling theory. Data scientists from all three sorts of background bring useful skills to the table, and so, in the grand scheme of things, this isn't a matter of who's right and who's wrong. However, you need to find the employers and job postings that match the background you have, or at least, to make the case to an employer that you have the skills and aptitude to do the job.
Yeah, I know, this is the section you really wanted to see. Logically, jobs are something you should look at after preparing yourself with all the other resources linked below, but I'll concede to reality here. :)
For tech jobs in general, there's Dice.
More specialized sites:
Of course, you'll find more job postings on, say, Indeed, which aggregates ads from many different sites, including Dice (I'm not sure if it includes AnalyticTalent or KDnuggets as well). The advantage of the more specialized job sites is that you can use more inclusive search terms without being overwhelmed by the numbers of hits. That being said, the search terms I've found most useful on Indeed are "'social science' research" (note the quotes), "statistician", and, obviously, "'data scientist'" (note the quotes again); "'data analyst'" tends to produce mostly jobs for traditional data analysts, specializing in computer skills.
For a different, and quite interesting, approach, try Hired, a headhunting firm that specializes in tech jobs, including data scientists; have a look at this article in Forbes for a full description. The catch is that Hired picks fewer than 10% of its applicants as candidates for its employer clients—but it promises 5 to 15 offers within a week for those who make the cut.
Finally, I've been asked by a representative of Lavastorm Analytics (a producer of business intelligence software) to add a link to the company's "Careers" page.
Professional Associations and Social Network Groups
The recently formed Data Science Association is the first professional association specifically devoted to the field. Yearly membership is free "for a limited time". The association plans a Forum and Ski Retreat in Vail, Colorado in March. Its website features a decent online library, a weekly list of data science news stories, and, of considerable importance for the professionalization of the field, a code of conduct.
Two active LinkedIn groups deal with data science and big data:
Big Data / Analytics / Strategy / FP&A / S&OP / Strategic Planning / Predictive & Business Analytics
Data Mining, Statistics, Big Data, and Data Visualization
The former group was established by IE. Analytics, and is the bigger of the two. The latter group, while smaller, seems to have more members from outside the U.S. There's naturally some overlap between the two in content and membership.
Group members do a great job of posting links to the latest articles in the field, and of course, the groups are wonderful for social networking. The unstructured nature of LinkedIn group discussions, though (think Twitter without hashtags) can make it hard to look for information on specific topics, and links to popular articles are often posted several times by different group members. I also find the job listings less than useful: those in the first group are not necessarily focused on data science, while those in the second, while more relevant, are also pretty small in number. Also, the "Big Data" group is moderated, and seems to suffer a bit from strange moderation decisions that sometimes see informational posts placed in the little-viewed "Promotions" section rather than "Discussions" (a fate not shared by IE. Analytics' own promotions). Moroever, the presence in the group of several female employees of IE. Analytics who tend to "Like" practically every post seems reminiscent of the pharmaceutical companies' former practice of hiring ex-cheerleaders to sell drugs to (mostly male) doctors.
The Data Science community on Google+ also covers the topic, though so far it has only a handful of members.
This list is very much incomplete, and I would certainly appreciate suggestions of additional blogs I can add.
We'll start with a site that aggregates multiple big data blogs, planet Big Data.
Ryan Swanstrom's Data Science 101, much like this blog, seeks to help those who want to become data scientists.
Gil Press's What's the Big Data? offers a wealth of current information on the field, with sections devoted to events, startups, interviews, and courses and graduate programs, in addition to the blog itself.
Academics may take particular interest in Zero Intelligence Agents, the blog of Drew Conway, co-author of Machine Learning for Hackers (see "Self-teaching Resources", below). This blog has been relatively quiet for the past year, but contains good examples of the application of data science methods to practical problems, and, especially, creative visualizations of the results.
The eponymous blog of Conway's co-authur, John Myles White, also makes an interesting read.
The MIKE 2.0 blogs explore a wide variety of topics. I find Phil Simon's posts particularly interesting.
Carl Anderson's blog p-value.info features "[m]usings on data science, machine learning, and statistics". Anderson addresses these topics from a practical perspecive, sometimes even including code in his posts.
The anonymous BInalytics blog focuses mainly on technical subjects, but also features some big-picture posts.
Noam Ross's eponymous blog offers quite a bit of advice on using R, and also some examples of the author's own research.
Jenna Dutcher writes the datascience@berkeley Blog for Berkeley's Data Science program. The blog features short commentaries on interesting articles, books, and videos in the data science field, and links to the original works.
Jeff Leek and Roger Peng, two of the three Johns Hopkins biostatistics professors who teach Coursera's new Data Science "Specialization", collaborate with Rafa Irizarry on the Simply Statistics blog. The authors have promised to feature top students from the Specialization in the blog.
Odds are, if you're a social scientist working to become a data scientist, you're going to have to teach yourself quite a bit. Of course, if you've got a PhD, you're pretty smart to begin with, and the fact that you want to become a data scientists suggests that you're pretty technically savvy; the upshot is that you could learn all of the things you need to know pretty quickly on the job. Unfortunately, most employers don't think like that, and write jobs ads as if they think potential employees are incapable of learning once hired (this is not entirely stupid, as an employer can be sure that someone who already knows a given skill will be able to use it, without having to worry about how well that employee can learn new ones), but it does unnecessarily filter out a lot of people who might be very useful—and given the speed at which data science and its associated computer applcations are evolving, anyone who can't learn new skills quickly is not going to be very useful anyway.
In any event, the reality is that the more you know going in, the more employable you'll be. You might not have years of experience, but you can easily teach yourself enough to pass certification exams, and by teaching yourself you can learn more cheaply and, usually, more quickly than if you took a formal course. Below, I've assembled a list of free or cheap learning resources (if you do take certification exams, they should be your main expense), and I hope to add to this list as time goes on.
The best place to start, in my opinion, is Stanford University's massive online open course (MOOC) Introduction to Databases, taught by Jennifer Widom and offered on the OpenEdX platform. Stanford is offering the course for the third time, in a session that started January 7. As with most MOOC's, you won't encounter much of a problem if you sign up a few days or even a couple of weeks late, but if you miss the Winter 2014 version, all of the materials for the Winter 2013 run, including interactive exercises, are available on Stanford's older Class2Go platform, creating what amounts to a self-guided tutorial. I would be surprised if the Winter 2014 version isn't archived in this fashion as well.
I was initially skeptical about this course, figuring that any broad survey of the field would touch on each topic too lightly to be of any pratical use. It turns out that I was wrong: yes, the course is an introductory one, and it covers a lot of topics, but on many of those topics, it goes into greater depth than specialized tutorials found elsewhere online, and it's particularly strong on SQL and database theory. The course provides exercises that pose real challenges, and presents them via an interactive platform that helps students to correct and learn from mistakes. Widom touts the course as being suitable for "a la carte" learning, but a novice will find all of the topics useful. You can see my full review of both this course and the Big Data University SQL course mentioned in the next paragraph here. Incidentally, I've listed more MOOC's below.
Big Data University has greatly expanded its course lineup in the past year. It offers a number of useful courses, including free courses on SQL, Java, Pig, Hive, and Hadoop (Remember, it's fun to say "Hadoop"!)—most of the courses are free, but very introductory. These are not MOOC's, but rather self-directed, at-your-own-pace tutorials. Big Data University's first SQL course is adequate for introducing the fundamentals of the query language, and also covers a lot of database theory, but the Stanford course is better on both scores, and the Big Data course has nothing to compare to the interactive exercises in the Stanford offering. One merit of the Big Data University course is that gives the student practical familiarity with setting up and using a common SQL package, IBM's DB2 Express-C, which can be downloaded for free. Another merit of the SQL course is that it's offered in Polish, Portuguese, Russian, and Spanish, as well as English; a second course is offered in English, Portuguese, and Spanish.
MySQL Tutorial, as one might expect from the name, features many useful tutorials on Oracle's popular open-source (and free) database program. Don't be put off by the writer's questionable grammar (he's obviously not a native speaker) or sometimes odd organization.
Download MySQL and the MySQL Reference Manual at the MySQL Developer Zone.
Dubois, Hinz, & Pedersen's MySQL 5.0 Certification Study Guide (available from both Amazon and Barnes & Noble—there are also Kindle and Nook editions) comes highly recommended, once you've learned MySQL and are ready to take the Developer exams; its biggest selling point is that it was written by the authors of the exams. (In the interest of full disclosure: yes, those are affiliate links.)
Oracle's New to Java Programming Center provides a good start with that language, including tutorials.
Code Academy offers a number of short, interactive tutorials that cover each of several programming languages, including Python.
Kevin Sheppard's Introduction to Python for Econometrics, Statistics and Data Analysis, a free ebook, provides a guide to the popular scripting language that's especially relevant for our purposes. Note that there's an eariler, incomplete version of the book on the web—make sure to use this link for the most recent version.
For learning R, the BInalytics blog recommends Jones, Maillardet, and Robinson's slightly pricey textbook Introduction to Scientific Programming and Simulation Using R, available from Amazon (and also in a Kindle version) and Barnes & Noble (there's no Nook version available).
Conway and White's Machine Learning for Hackers insists it's not a guide for learning R, but you wouldn't be the first person to use it as a way to learn R while also studying machine learning; it has an accompanying website with code samples and other goodies. You can of course get it at Amazon or Barnes & Noble; both Kindle and Nook editions are available, and O'Reilly also offers an upgrade option to receive updates and non-DRM copies (for the Nook edition, at least, which is the one I own).
In a similar vein is Torgo's Data Mining with R: Learning with Cases Studies, which the author of the BInalytics blog recommends highly for its case approach, though he does find it less challenging than Conway and White's book. It too is available at Amazon (and also in a Kindle version) and Barnes & Noble (again, there's no Nook version).
The R Project for Statistical Computing is a source for all things R.
RSeek can help you find additional information on a language whose name gives search engines fits.
Code School offers a very basic but very accessible online course called Try R, complete with a pirate theme and badges for completing each chapter. Completing the course takes only a couple of hours, at most, and on completion you'll be offered discounts on O'Reilly ebooks (50%) and print books (40%). A course on Ruby and a zombie-themed course on Ruby on Rails might also be relevant to a data scientist. All three of these are free, but Code School also offers paid courses.
You might also check out two recent posts by Noam Ross, one a recount of a talk on debugging tools in R, and the other a very practical seet of recommendations for speeding up R code.
For using R with big data, BInalytics recommends a series of tutorials posted by Jeffrey Breen on his Things I Tend to Forget blog. These tutorials make use of the RHadoop packages published by Revolution Analytics.
Obviously, Conway and White's Machine Learning for Hackers gives an introduction to machine learning as well as introduction to R. Recently, a free alternative has become available for studying both of these subjects: James, Witten, Hastie, and Tibshiranti's An Introduction to Statistical Learning, with Applications in R.
One intriguing option is Coursera, which offers free "Massive Open Online Courses" (MOOC's) from well-known universities. Most of these courses are not for college credit, you may have to wait a while until the course you need starts, and many of the courses are surveys (including two courses on big data, Introduction to Data Science and Web Intelligence and Big Data), rather than focusing on more specific, practically useful topics, but Coursera promises a degree of academic rigor not found in the average online tutorial, as well as extra features such as machine- and peer-grading. The fact that co-founder Andrew Ng researches machine learning bodes well for offerings related to data science.
The anonymous author of the BInalytics blog has identified a number of Coursera courses on data science topics that teach their students to program in R as part of the curriculum: Statistics One (probably not of much use to a social scientists with a quantitative background), Data Analysis, Computing for Data Analysis, Social Network Analysis, Mathematical Biostatistics Boot Camp, and Introduction to Computational Finance and Financial Econometrics. For other languages, he recommends Computational Methods for Data Analysis (MATLAB), Probabilistic Graphical Models (Octave/MATLAB), Passion Driven Statistics (SAS), and Computational Investing (Python); I should add that Andrew Ng's own Machine Learning course also uses Octave. The opportunity to study a substantive data science topic while learning a programming language at the same time is a good two-for-one deal.
Coursera also offers (non-programming) courses on business, which is an area that those interested in becoming data scientsts should not neglect. BInalytics mentions Financial Engineering and Risk Management as being potentially useful to data scientists, and I'm currently enrolled in almost entirely non-technical Foundations of Business Strategy.
Recently, Coursera has moved toward paying for extra features or, in a few cases, toward courses that cannot be taken for free. For more details, see below under "Formal Learning Resources".
In the same vein are Udacity and edX, the latter of which, like Coursera, offers courses in partnership with pretigious universities. At the moment, both services have more limited offerings that Coursera, but their catalogs are growing quickly, and Udacity offers a short course on Hadoop, a topic not covered by Coursera. In the beginning, their programming courses tended either to be very basic or to survey their topics at a general, conceptual level, but these offerings will be useful to those without a lot of programming experience, and their newer offerings include more advanced courses (Udacity also has one business course). Udacity's courses, it should be noted, are not stricly speaking MOOC's—like an online tutorial, you work at your own pace. However, for a fee, Udacity offers some of the interactivity that comes with a MOOC—see below, under "Formal Learning Resources".
Stanford University has adopted a policy of offering online classes through a variety of different outlets, including Coursera. Stanford's in-house efforts began with Class2Go which offered three courses, including Widom's Winter 2013, which I described above. The University's latest effort uses the open-source OpenEdX platform; a growing list of courses includes the latest iteration of Introduction to Databases, as well a new offering called Statistical Learning, starting January 21. The latter course, taught by Trevor Hastie and Rob Tibshirani, seems to cover much the same ground as Ng's Machine Learning, but using R rather than MATLAB/Octave. Hastie and Tibshirani (with Gareth James and Daniela Witten) co-authored a book on the same material, An Introduction to Statistical Learning, with Applications in R, and in conjunction with the course, the book's publisher has made a PDF version available for free, to members of the public as well as students in the course.
Stanford also offers free courses through iTunes U, including a version of Ng's Machine Learning.
Finally, both KDnuggets and Udemy list a few free online courses among many more paid ones.
For those interested in more advanced topics, a list of free e-books can be found in Carl Anderson's blog p-value.info.
In a similar vein, Ryan Swanstrom has posted a list of free data science e-journals on his Data Science 101 blog.
Machine Learning Surveys bills itself as a "list of literature surveys, reviews, and tutorials on Machine Learning and related topics". As of January 2013, it listed 123 resources.
Formal Learning Resources
Mind you, there's also something to be said for taking formal classes: you've got something to put on your resume, you don't have to worry as much about motivating yourself, and you might even be able to get a recommendation from a teacher impressed by your aptitude. On top of all that, many courses include certification exams for "free". If you can afford the time and money for classes, they might well be a good option.
KDnuggets also lists a wide range of short training courses and university programs, as well as featuring sections on software, news, conferences, and even publicly available datasets, among a number of other things.
Udemy offers many paid but inexpensive (and occasionally free) courses on subjects relevant to data science. The site is notable for offering several, competing courses on many topics, with user reviews to help you make choices. One of my readers has recommended this $59 course on Java, though Udemy has a number of other offerings on that programming language.
Both Coursera and Udacity have moved recently toward pay models. Reports of cheating have darkened Coursera's reputation somewhat. Its "Signature Track" purportedly seeks to address this issue by requiring stringent identification procedures (including typing style detection) and a fee of $30-$100 in exchange for a "verified" certificate that can be shared with a college or employer (this is currently available only for a few courses, though you can still take those courses in the normal, free way). Given the doubtful efficacy of these identification measures (they wouldn't, for example, prevent a student from uploading an assignment file created by someone else), whether employers will place any stake in these verifiable certificates (as opposed to a simple line on a resume listing the course) remains an open question; their real value at present probably lies in a concurrent effort that has obtained American Council on Education (ACE) Collge Credit Recommendation Service (CREDIT—no, I don't know how that acronym works) recommendation for some of the introductory-level courses, a step that will allow students to gain college credit, for an additional fee assessed for an online Credit Exam. However, most readers of this blog are probably not undegraduate students taking introductory courses, and it's doubtful that credit will be offered by universities for the more advanced courses of interest to an aspiring data scientist (indeed, there are several courses offered with a Signature Track that aren't among those recommended by ACE CREDIT).
More recently, Coursera has begun to offer paid tutoring through Google's Helpouts (I'm not clear on whether Coursera is getting any revenue from this).
In addition, the company has begun to offer "Specializations", each of which feature a series of short courses, followed by a capstone project of some sort. If you take all of the coures on the Signature Track, and then finish the capstone project, you'll receive a certificate for the Specialzation. For example, the new Data Science specialization, offered in conjunction with Johns Hopkins, includes nine short courses and a capstone, each for $49, for a total of $490, including the opportunity to retake failed courses for up to two years; the first iteration will run from Apr. 7 to roughly the end of July. All of the courses are also available for free, but you can't enroll for the capstone project unless you pay for the entire Specialization. The Data Science specialization looks to be fairly useful, but, at a cursory glance, seems to have little coverage of databases; on the other hand, the Getting and Cleaning Data course offers to impart some very practical skills, such as using API's to extract data from the web (this in particular is something that I personally have been wanting to learn).
Udacity offers its at-your-own-pace courses for free, but for a third of its courses, including the ones most useful for aspring data scientists, offers extra services (called "Full Courses"), such as coaching and product feedback, for a "subscription" fee, typically on the order of $150 per month per course (meaning that the faster you finish the course, the less you pay).
The Georgia R School offers 10 courses on R for $95 each, with monthly and yearly memberships available. There's a discount for a students, and the school offers a 14-day free trial.
Cloudera offers not only online courses, but certifications as well, for its Apache Hadoop platform.
An increasing number of universities offer master's degrees or graduate certificates in data science, data mining, or business analytics. These problems offer the allure of a solid credential, as with business degrees, costs tend to be high, and financial aid (other than student loans) scarce.
Doug Henschen of Information Week has catalogued what he regards as top 20 master's degrees in the field, with a mention of 10 additional programs, and the promise of more to come. Many of the universities offer part-time and/or online curricula, many offer (shorter, cheaper) certificates as well as degrees.
Other lists can be found in Gil Press's What's the Big Data? blog, in Ryan Swanstrom's Data Science 101 blog, and on the homepage of North Carolina State University's Insitute for Advanced Analytics.
You might also consider applying for the Insight Data Science Fellows Program, especially if you're a PhD candidate or new graduate who wants to work in the Bay Area of California. This is a six-week program that's project-based (rather than classroom-based) and includes mentoring and interviews with top Silicon Valley companies.
Do you find it hard to motivate yourself to practice your skills on canned exercises? If so, check the following websites for real-world data you can download and get your hands dirty with.
One problem with finding data is that many websites that provide public access to databases do so only through web interfaces that, while often sporting impressive visualization tools, don't allow for serious statistical analysis (probably by design—organizations often have no desire to give up proprietary data for free). If you really want to analyze data, you need to be able to download an entire dataset (or a subset of it). All of the sites below allow free downloads of databases in one form or another. Incidentally, I do need to note that I shamelessly took the first three of these sites from Gil Press's October, 2012 article on Foreign Policy's website, "10 Big Data Sites to Watch".
The U.S. government's Data.gov offers approximately three zillion datasets for your analytical pleasure (well, actually, close to 400,000, but that's still more than you can examine in your lifetime).
DataMarket offers an intriguing variety of both government and industry data. As far as I can tell, datasets can be downloaded only through an API (at least, for free users), rather than in ASCII or Excel form, but data from different sets can be combined, and, quite usefully, DataMarket provides links to the providers of the data, making alternate methods of download possible in many cases.
As mentioned above, KDnuggets lists a number of sites with publicly available datasets. These include images, blog posts, and even songs.
The U.S. Census Bureau provides a variety of demographic and economic data. The website's organization leaves something to be desired, but look here, here, and here for downloadable datasets.
The UCI Machine Learning Repository, maintained by UC Irvine's Center for Machine Learning and Intelligent Systems, houses 235 datasets as of January 2013.
If you're interested in politics, try the data from the American National Election Studies, widely considered the most important series of surveys of U.S. voters.
If health is more your cup of tea, the National Center for Health Statistics, part of the Centers for Disease Control and Prevention, provides an interesting range of data on its FTP server.
Many devotees of data science swear by public competitions as a mean to hone their craft, as well as to gain public notice that might lead to employment. This is obviously not really an option for a beginner, but the thrill of competition can certainly provide good motivation for learning.
Kaggle, of course, is the best-known host of data competitions, with 75 competitions, 10 of which are currently active, as of January 2013.
Innocentive hosts challenges from a broad array of disciplines. Many of these problems are tractable to data science.
TunedIT hosts competitions (it lists 32 as of January 2013) as a means of promoting its data mining platform.