The Aspirational Data Scientist: Useful Links

Below, I've included links to a number of useful resources for social scientists who want to become data scientists; to make the list even more useful, I've added a little bit of information about each link. If you have any suggestions for additional resources I can list, please let me know! I'd also like to be alerted to broken links, of course.

Note that you can use the links in the sidebar to jump to a specific section of this page from here or anywhere else in the blog.

Defining the Data Scientist—and Why Anyone Would Hire You as One

The articles below move from defining data science, to describing what data scientists do, to giving practical advice on becoming a data scientist:

"What Is Data Science? What Is Analytics? What Is a Data Scientist?"

Davenport and Patil, "Data Scientist: The Sexiest Job of the 21st Century (Executive Summary)"

Press, "Big Data News of the Week: Sexy and Social Data Scientists"

Miller, "Data as a Social Science"

Dyche, "Big Data 'Eurekas!' Don't Just Happen"

Roe, "So You Want to Be a Data Scientist?"

Koploy, "Three Career Secrets for Aspiring Data Scientists"

Koploy, "Advice for the Aspiring Data Scientist"

Roy, "What to Look for When Hiring a BI Specialist"

"On Becoming a Data Scientist", "Part 1—The Destination", "Part 2—The Technical Core, for Free", "Part 3—The Softer Side", and "Part 4—Managing". (Part 2, incidentally, has a long list of resources to help in learning R.)

A really great article offering practical advice on making a career change to data science:

Jain, "Planning a Late Career Shift to Analytics/Big Data? Better Be Prepared!"

(I can confirm the wisdom of a lot of Jain's points from my own experience.)

And last but not least, some advice (not specific to data scientists) for moving from academia to industry:

Khalilov, "A Guideline to Move from Academia to Industry—Part 1" and "A Guideline to Move from Academia to Industry—Part 2"

Wood, "The Ph.D.'s Guide to a Nonfaculty Job Search"

As the articles linked above make obvious, there's considerable disagreement over what a data scientist is and where they come from. Everyone agrees that data scientists attempt to extract useful information from big data (another term open to interpretation, by the way). However, some writers focus on how data scientists approach problems; they see data scientists as researchers who bring a particular way of seeking knowledge (the scientific method, with its rigorous approach to testing hypotheses through statistical analysis of experimental and quasi-experimental designs) to a new domain. Other writers focus on the computer skills required to use the tools that data scientists use; they see the data scientist as an improved version of the traditional data analyst, adding new tools such as MapReduce and Hadoop to the traditional skill set of SQL, Java, C++, and the like. There's also a third school of thought, which I've seen in job ads, that focuses on experience in marketing research; individuals with this experience often possess business degrees, though they typically have picked up some statistical and database skills as well.

Real data scientists fit all of these molds. More to the point, for an aspiring data scientist, real employers may subscribe to one vision or another, and post job ads that reflect that point of view. If you've been trained as a social scientist, odds are you don't have the 5+ years of SQL or C++ experience that some employers seek. On the other hand, even if we're talking about technical skills, someone trained as a database administrator probably doesn't know a whole lot about advanced econometrics or sampling theory. Data scientists from all three sorts of background bring useful skills to the table, and so, in the grand scheme of things, this isn't a matter of who's right and who's wrong. However, you need to find the employers and job postings that match the background you have, or at least, to make the case to an employer that you have the skills and aptitude to do the job.

Job Postings

Yeah, I know, this is the section you really wanted to see. Logically, jobs are something you should look at after preparing yourself with all the other resources linked below, but I'll concede to reality here. :)

For tech jobs in general, there's Dice.

More specialized sites:

DataJobs

AnalyticTalent

KDnuggets Jobs

Of course, you'll find more job postings on, say, Indeed, which aggregates ads from many different sites, including Dice (I'm not sure if it includes DataJobs, AnalyticTalent, or KDnuggets as well). The advantage of the more specialized job sites is that you can use more inclusive search terms without being overwhelmed by the numbers of hits. That being said, the search terms I've found most useful on Indeed are "'social science' research" (note the quotes), "statistician", and, obviously, "'data scientist'" (note the quotes again); "'data analyst'" tends to produce mostly jobs for traditional data analysts, specializing in computer skills.

For a different, and quite interesting, approach, try Hired, a headhunting firm that specializes in tech jobs, including data scientists; have a look at this article in Forbes for a full description. The catch is that Hired picks fewer than 10% of its applicants as candidates for its employer clients—but it promises 5 to 15 offers within a week for those who make the cut.

Professional Associations and Social Network Groups

The recently formed Data Science Association is the first professional association specifically devoted to the field. Yearly membership is free "for a limited time". The association holds events from time to time, but the website doesn't list any for 2017. The site features a decent online library, a weekly list of data science news stories, and, of considerable importance for the professionalization of the field, a code of conduct.

Three active LinkedIn groups deal with data science and big data:

Big Data / Analytics / Strategy / FP&A / S&OP / Strategic Planning / Predictive & Business Analytics

Data Mining, Statistics, Big Data, and Data Visualization

Research Methods and Data Science (RMDS)

The first group was established by IE. Analytics, and is the biggest of the three. The second group, while smaller, seems to have more members from outside the U.S. There's naturally some overlap between the two in content and membership. The third is the smallest of the three, and seems to lean a little towards academic topics of discussion.

Group members do a great job of posting links to the latest articles in the field, and of course, the groups are wonderful for social networking. The unstructured nature of LinkedIn group discussions, though (think Twitter without hashtags) can make it hard to look for information on specific topics, and links to popular articles are often posted several times by different group members, and in all three groups. I also find the job listings less than useful: those in the first group are not necessarily focused on data science, while those in the second, though more relevant, are also pretty small in number, and the third group has only a handful.

Also, the "Big Data" group is moderated, and seems to suffer a bit from strange moderation decisions that sometimes see informational posts placed in the little-viewed "Promotions" section rather than "Discussions" (a fate not shared by IE. Analytics' own promotions). Moroever, the presence in the group of several female employees of IE. Analytics who tend to "Like" practically every post seems reminiscent of the pharmaceutical companies' former practice of hiring ex-cheerleaders to sell drugs to (mostly male) doctors.

The Data Science community on Google+ also covers the topic, though it has only a handful of members.

Blogs

This list is very much incomplete, and I would certainly appreciate suggestions of additional blogs I can add.

We'll start with a site that aggregates multiple big data blogs, planet Big Data.

Ryan Swanstrom's Data Science 101, much like this blog, seeks to help those who want to become data scientists.

Gil Press's What's the Big Data? offers a wealth of current information on the field, with sections devoted to events, startups, interviews, and courses and graduate programs, in addition to the blog itself.

Academics may take particular interest in Zero Intelligence Agents, the blog of Drew Conway, co-author of Machine Learning for Hackers (see "Free and Cheap Learning Resources", formerly "Self-teaching Resources", below). This blog hasn't seen a new post since 2014, but contains good examples of the application of data science methods to practical problems, and, especially, creative visualizations of the results.

The eponymous blog of Conway's co-authur, John Myles White, also makes an interesting read.

The MIKE 2.0 blogs explore a wide variety of topics. I find Phil Simon's posts particularly interesting.

Carl Anderson's blog p-value.info features "[m]usings on data science, machine learning, and statistics". Anderson addresses these topics from a practical perspecive, sometimes even including code in his posts.

The anonymous BInalytics blog focuses mainly on technical subjects, but also features some big-picture posts.

Noam Ross's eponymous blog offers quite a bit of advice on using R, and also some examples of the author's own research.

Jenna Dutcher writes the datascience@berkeley Blog for Berkeley's Data Science program. The blog features short commentaries on interesting articles, books, and videos in the data science field, and links to the original works.

Jeff Leek and Roger Peng, two of the three Johns Hopkins biostatistics professors who teach Coursera's new Data Science "specialization", collaborate with Rafa Irizarry on the Simply Statistics blog. The authors have promised to feature top students from the specialization in the blog.

Tommy Jones' Biased Estimates covers a variety of data science topics, including good coverage of the goings-on of Data Community DC, which unites data scientists and their ilk in the Washington, DC area.

Free and Cheap Learning Resources (aka, Self-teaching Resources)

Odds are, if you're a social scientist working to become a data scientist, you're going to have to teach yourself quite a bit. Of course, if you've got a PhD, you're pretty smart to begin with, and the fact that you want to become a data scientist suggests that you're pretty technically savvy; the upshot is that you could learn all of the things you need to know pretty quickly on the job. Unfortunately, most employers don't think like that, and write job ads as if they think potential employees are incapable of learning once hired. This is not entirely stupid, as an employer can be sure that someone who already knows a given skill will be able to use it, without having to worry about how well that employee can learn new ones, but it does unnecessarily filter out a lot of people who might be very useful—and given the speed at which data science and its associated computer applications are evolving, anyone who can't learn new skills quickly is not going to be very useful anyway.

In any event, the reality is that the more you know going in, the more employable you'll be. You might not have years of experience, but you can easily teach yourself enough to pass certification exams, and by teaching yourself you can learn more cheaply and, usually, more quickly than if you took a formal course. Below, I've assembled a list of free or cheap learning resources (if you do take certification exams, they should be your main expense), and I hope to add to this list as time goes on.

The best place to start, in my opinion, is Stanford University's Databases (formerly "Introduction to Databases"), taught by Jennifer Widom and offered on the OpenEdX platform. This was originally a "massive open online course" (MOOC)—that is, thousands of people took it together, during a defined time period—but Stanford now offers the course material as a series of 14 self-paced mini-courses.

I was initially skeptical about this course, figuring that any broad survey of the field would touch on each topic too lightly to be of any pratical use. It turns out that I was wrong: yes, the course is an introductory one, and it covers a lot of topics, but on many of those topics, it goes into greater depth than specialized tutorials found elsewhere online, and it's particularly strong on SQL and database theory. The course provides exercises that pose real challenges, and presents them via an interactive platform that helps students to correct and learn from mistakes. Widom touts the course as being suitable for "a la carte" learning (hence the 14 mini-courses), but a novice will find all of the topics useful. You can see my full review of both this course and the Big Data University SQL course mentioned below here.

For me, one of the most useful resources for learning data science topics has been Coursera, which offers free MOOC's from well-known universities. Yes, there are plenty of tutorials and self-paced courses out there, but I find the deadlines provided by a MOOC to be a useful way to keep myself on track—even if there are no real consequences to missing them. Most of these courses are not for college credit, you may have to wait a while until the course you need starts, and many of the courses are surveys (including two courses on big data, Introduction to Data Science and Web Intelligence and Big Data), rather than focusing on more specific, practically useful topics, but Coursera promises a degree of academic rigor not found in the average online tutorial, as well as extra features such as machine- and peer-grading. The fact that co-founders Andrew Ng and Daphne Koller research machine learning bodes well for offerings related to data science.

The anonymous author of the BInalytics blog has identified a number of Coursera courses on data science topics that teach their students to program in R as part of the curriculum: Statistics One (probably not of much use to a social scientists with a quantitative background), Data Analysis, Computing for Data Analysis, Social Network Analysis, Mathematical Biostatistics Boot Camp, and Introduction to Computational Finance and Financial Econometrics. For other languages, he recommends Computational Methods for Data Analysis (MATLAB), Probabilistic Graphical Models (Octave/MATLAB), Passion Driven Statistics (SAS), and Computational Investing (Python); I should add that Andrew Ng's own Machine Learning course also uses Octave. The opportunity to study a substantive data science topic while learning a programming language at the same time is a good two-for-one deal.

Coursera also offers (non-programming) courses on business, which is an area that those interested in becoming data scientsts should not neglect. BInalytics mentions Financial Engineering and Risk Management as being potentially useful to data scientists, and I took the almost entirely non-technical, but quite interesting, Foundations of Business Strategy.

Recently, Coursera has moved toward paying for extra features or, in a few cases, toward courses that cannot be taken for free. The capstone course of Coursera's new Data Science specialization falls into this category, but the other courses in the specialization can still be taken for free. For more details, see below under "Paid Courses, Certificates, and Degrees". In addition, check out the specialization's GitHub repository to find copies of the courses' lecture notes, which make up an excellent reference source for anyone using R to solve data science problems.

In the same vein as Coursera are Udacity and edX, the latter of which, like Coursera, offers courses in partnership with pretigious universities. At the moment, both services have more limited offerings than Coursera, but their catalogs are growing quickly, and Udacity offers a short course on Hadoop, a topic not covered by Coursera. In the beginning, their programming courses tended either to be very basic or to survey their topics at a general, conceptual level, but these offerings will be useful to those without a lot of programming experience, and their newer offerings include more advanced courses (Udacity also has one business course). Udacity's courses, it should be noted, are not stricly speaking MOOC's—like an online tutorial, you work at your own pace. However, for a fee, Udacity offers some of the interactivity that comes with a MOOC—see below, under "Paid Courses, Certificates, and Degrees".

Stanford University has adopted a policy of offering online classes through a variety of different outlets, including Coursera. Stanford's in-house efforts began with Class2Go, which offered three courses, including Widom's Introduction to Databases, which I described above. The University's latest effort uses the open-source OpenEdX platform; a growing list of courses began with an iteration of Databases, as well a new offering called Statistical Learning. The latter course, taught by Trevor Hastie and Rob Tibshirani, seems to cover much the same ground as Ng's Machine Learning, but using R rather than MATLAB/Octave. Hastie and Tibshirani (with Gareth James and Daniela Witten) co-authored a book on the same material, An Introduction to Statistical Learning, with Applications in R, and in conjunction with the course, the book's publisher made a PDF version available for free, to members of the public as well as students in the course.

Stanford also offers free courses through iTunes U, including a version of Ng's Machine Learning.

Finally, both KDnuggets and Udemy list a few free online courses among many more paid ones.

Big Data University offers a number of useful courses, including free courses on SQL, Java, Pig, Hive, and Hadoop (Remember, it's fun to say "Hadoop"!)—most of the courses are free, but very introductory. These are not MOOC's, but rather self-directed, at-your-own-pace tutorials. Big Data University's first SQL course is adequate for introducing the fundamentals of the query language, and also covers a lot of database theory, but the Stanford course is better on both scores, and the Big Data course has nothing to compare to the interactive exercises in the Stanford offering. One merit of the Big Data University course is that gives the student practical familiarity with setting up and using a common SQL package, IBM's DB2 Express-C, which can be downloaded for free. Another merit of the SQL course is that it's offered in Polish, Portuguese, Russian, and Spanish, as well as English; a second course is offered in English, Portuguese, and Spanish.

MySQL Tutorial, as one might expect from the name, features many useful tutorials on Oracle's popular open-source (and free) database program. Don't be put off by the writer's questionable grammar (he's obviously not a native speaker) or sometimes odd organization.

Download MySQL and the MySQL Reference Manual at the MySQL Developer Zone.

Dubois, Hinz, & Pedersen's MySQL 5.0 Certification Study Guide (available from both Amazon and Barnes & Noble—there are also Kindle and Nook editions) comes highly recommended, once you've learned MySQL and are ready to take the Developer exams; its biggest selling point is that it was written by the authors of the exams. (In the interest of full disclosure: yes, those are affiliate links.)

Oracle's New to Java Programming Center provides a good start with that language, including tutorials.

Code Academy offers a number of short, interactive tutorials that cover each of several programming languages, including Python.

Kevin Sheppard's Introduction to Python for Econometrics, Statistics and Data Analysis, a free ebook, provides a guide to the popular scripting language that's especially relevant for our purposes. Note that there's an eariler, incomplete version of the book on the web—make sure to use this link for the most recent version.

For learning R, Data Camp offers a small but growing collection of interactive tutorials, two of which constitute the programming exercises of Coursera courses: Eric Zivot's Introduction to Computational Finance and Financial Econometrics and Mine Çetinkaya-Rundel's Data Analysis and Statistical Inference. (The statistics taught in the latter course are probably a little basic for anyone coming from a social science background—the most advanced topic is multiple regression.) Taking the Coursera courses signs you up for their Data Camp components, but you can also take the Data Camp courses by themselves, whether or not the Coursera courses are being offered at the time. Data Camp also plans offerings with Revolution Analytics and RStudio in the near future.

The BInalytics blog recommends Jones, Maillardet, and Robinson's slightly pricey textbook Introduction to Scientific Programming and Simulation Using R, available from Amazon (and also in a Kindle version) and Barnes & Noble (there's no Nook version available).

Conway and White's Machine Learning for Hackers insists it's not a guide for learning R, but you wouldn't be the first person to use it as a way to learn R while also studying machine learning; it has an accompanying website with code samples and other goodies. You can of course get it at Amazon or Barnes & Noble; both Kindle and Nook editions are available, and O'Reilly also offers an upgrade option to receive updates and non-DRM copies (for the Nook edition, at least, which is the one I own).

In a similar vein is Torgo's Data Mining with R: Learning with Cases Studies, which the author of the BInalytics blog recommends highly for its case approach, though he does find it less challenging than Conway and White's book. It too is available at Amazon (and also in a Kindle version) and Barnes & Noble (again, there's no Nook version).

A free alternative for studying both machine learning and R is James, Witten, Hastie, and Tibshiranti's An Introduction to Statistical Learning, with Applications in R.

The R Project for Statistical Computing is a source for all things R.

RSeek can help you find additional information on a language whose name gives search engines fits.

Code School offers a very basic but very accessible online course called Try R, complete with a pirate theme and badges for completing each chapter. Completing the course takes only a couple of hours, at most, and on completion you'll be offered discounts on O'Reilly ebooks (50%) and print books (40%). A course on Ruby and a zombie-themed course on Ruby on Rails might also be relevant to a data scientist. All three of these are free, but Code School also offers paid courses.

You might also check out two posts by Noam Ross, one a recount of a talk on debugging tools in R, and the other a very practical seet of recommendations for speeding up R code.

For using R with big data, BInalytics recommends a series of tutorials posted by Jeffrey Breen on his Things I Tend to Forget blog. These tutorials make use of the RHadoop packages published by Revolution Analytics.

For those interested in more advanced topics, a list of free e-books can be found in Carl Anderson's blog p-value.info.

In a similar vein, Ryan Swanstrom has posted a list of free data science e-journals on his Data Science 101 blog.

Machine Learning Surveys bills itself as a "list of literature surveys, reviews, and tutorials on Machine Learning and related topics". As of January 2013, it listed 123 resources.

Paid Courses, Certificates, and Degrees

Mind you, there's also something to be said for taking formal classes: you've got something to put on your resume, you don't have to worry as much about motivating yourself, and you might even be able to get a recommendation from a teacher impressed by your aptitude. On top of all that, many courses include certification exams for "free". If you can afford the time and money for classes, they might well be a good option.

KDnuggets also lists a wide range of short training courses and university programs, as well as featuring sections on software, news, conferences, and even publicly available datasets, among a number of other things.

Udemy offers many paid but inexpensive (and occasionally free) courses on subjects relevant to data science. The site is notable for offering several, competing courses on many topics, with user reviews to help you make choices. One of my readers has recommended this $59 course on Java, though Udemy has a number of other offerings on that programming language.

Both Coursera and Udacity have moved recently toward pay models. Reports of cheating have darkened Coursera's reputation somewhat. Its "Signature Track" purportedly seeks to address this issue by requiring stringent identification procedures (including typing style detection) and a fee of $30-$100 in exchange for a "verified" certificate that can be shared with a college or employer (this is currently available only for a few courses, though you can still take those courses in the normal, free way).

Given the doubtful efficacy of these identification measures (they wouldn't, for example, prevent a student from uploading an assignment file created by someone else), whether employers will place any stake in these verifiable certificates (as opposed to a simple line on a resume listing the course) remains an open question; their real value at present probably lies in a concurrent effort that has obtained American Council on Education (ACE) Collge Credit Recommendation Service (CREDIT—no, I don't know how that acronym works) recommendation for some of the introductory-level courses, a step that will allow students to gain college credit, for an additional fee assessed for an online Credit Exam. However, most readers of this blog are probably not undegraduate students taking introductory courses, and it's doubtful that credit will be offered by universities for the more advanced courses of interest to an aspiring data scientist (indeed, there are several courses offered with a Signature Track that aren't among those recommended by ACE CREDIT).

More recently, Coursera has begun to offer paid tutoring through Google's Helpouts (I'm not clear on whether Coursera is getting any revenue from this).

In addition, the company has begun to offer "specializations", each of which feature a series of short courses, followed by a capstone project of some sort. If you take all of the coures on the Signature Track, and then finish the capstone project, you'll receive a certificate for the Specialzation. For example, the new Data Science specialization, offered in conjunction with Johns Hopkins, includes nine short courses and a capstone, each for $49, for a total of $490, including the opportunity to retake failed courses for up to two years; the first iteration will run from Apr. 7 to roughly the end of July. All of the courses are also available for free, but you can't enroll for the capstone project unless you pay for the entire specialization.

The Data Science specialization, which uses R and Git throughout, looks to be fairly useful, but much of the specialization concentrates on statistics and good scientific practice, which, while quite valuable, are already familiar to most social scientists. It also has little coverage of databases, other than how to read data from them using R. On the other hand, the Getting and Cleaning Data course offers to impart some very practical skills, such as using API's and web scraping to extract data from the web; it's also the course that covers extracting data from databases.

Udacity offers its at-your-own-pace courses for free, but for a third of its courses, including the ones most useful for aspring data scientists, offers extra services (called "Full Courses"), such as coaching and product feedback, for a "subscription" fee, typically on the order of $150 per month per course (meaning that the faster you finish the course, the less you pay).

Similar to Coursera's specializations, Udacity has recently announced four upcoming "nanodegrees". The Data Analytics nanodegree, which is "[p]roduced in collaboration with" AT&T, Clouderea, Facebook, and MongoDB, is likely aimed at the same audience as Coursera's Data Science specialization. Nanodegrees will take 6-12 months to complete, at a cost of $200 a month.

Even more intriguing is Coursera's Master of Science in Computer Science, offered in conjunction with Georgia Tech and AT&T. The selection of courses on offer is more limited that that found on campus, but the cost, at under $7,000, is about a third of what they'd pay for the on-campus version. Students must gain admittance to Tech (coincidentally, my own alma mater), and the degree program will soon be taking applications for Fall 2015. I'm not clear on why applications have to be taken nearly a year in advance, and I've certainly seen other online programs that don't require that much lead time for applicants.

The Georgia R School offers 10 courses on R for $95 each, with monthly and yearly memberships available. There's a discount for a students, and the school offers a 14-day free trial.

Cloudera offers not only online courses, but certifications as well, for its Apache Hadoop platform.

An increasing number of universities offer master's degrees or graduate certificates in data science, data mining, or business analytics. These problems offer the allure of a solid credential, as with business degrees, costs tend to be high, and financial aid (other than student loans) scarce.

Doug Henschen of Information Week has catalogued what he regards as the top 20 master's degrees in the field, with a mention of 10 additional programs, and the promise of more to come. Many of the universities offer part-time and/or online curricula, and many offer (shorter, cheaper) certificates as well as degrees.

Probably the most complete list (with over 200 programs) is Ryan Swanstrom's Data Science Colleges, spun out of his Data Science 101 blog. Other lists can be found in Gil Press's What's the Big Data? blog and on the homepage of North Carolina State University's Insitute for Advanced Analytics.

You might also consider applying for the Insight Data Science Fellows Program, especially if you're a PhD candidate or new graduate who wants to work in the Bay Area of California. This is a six-week program that's project-based (rather than classroom-based) and includes mentoring and interviews with top Silicon Valley companies.

Data Sources

Do you find it hard to motivate yourself to practice your skills on canned exercises? If so, download some real-world data and get your hands dirty.

One of the easiest's ways to access large amounts of data is Quandl, which has collected over 10 million economic and social datasets from over 500 sources from around the world. You can browse data in Quandl's web interface, and download it either in standard formats or through an API supported by libraries and plugins for a wide variety of languages and applications. Data Camp (mentioned above, under "Free and Cheap Learning Resources") offers a course called "How to Work with Quandl in R".

Below, you'll find a list of data sources that I assembled some time ago. Since then, I've found a list that puts mine (and probably anyone else's) to shame: Jeffrey Leek provides it in the last lecture of his Getting and Cleaning Data course on Coursera, part of the Data Science specialization. You can find his long list of links in the notes for that lecture, which is titled "Data Resources". The notes, incidentally, are an HTML5 document—you'll need to use the "Page Down" button to scroll through them. You can access theese without signing up on Coursera or joining the class, though you would have to do so to watch the lecture itself.

One problem with finding data is that many websites that provide public access to databases do so only through web interfaces that, while often sporting impressive visualization tools, don't allow for serious statistical analysis (probably by design—organizations often have no desire to give up proprietary data for free). If you really want to analyze data, you need to be able to download an entire dataset (or a subset of it). All of the sites below allow free downloads of databases in one form or another. Incidentally, I do need to note that I shamelessly took the first three of these sites from Gil Press's October, 2012 article on Foreign Policy's website, "10 Big Data Sites to Watch".

The U.S. government's Data.gov offers approximately three zillion datasets for your analytical pleasure (well, actually, close to 400,000, but that's still more than you can examine in your lifetime).

DataMarket offers an intriguing variety of both government and industry data. As far as I can tell, datasets can be downloaded only through an API (at least, for free users), rather than in ASCII or Excel form, but data from different sets can be combined, and, quite usefully, DataMarket provides links to the providers of the data, making alternate methods of download possible in many cases.

As mentioned above, KDnuggets lists a number of sites with publicly available datasets. These include images, blog posts, and even songs.

The U.S. Census Bureau provides a variety of demographic and economic data. The website's organization leaves something to be desired, but look here, here, and here for downloadable datasets.

The UCI Machine Learning Repository, maintained by UC Irvine's Center for Machine Learning and Intelligent Systems, houses 235 datasets as of January 2013.

If you're interested in politics, try the data from the American National Election Studies, widely considered the most important series of surveys of U.S. voters.

If health is more your cup of tea, the National Center for Health Statistics, part of the Centers for Disease Control and Prevention, provides an interesting range of data on its FTP server.

Contests

Many devotees of data science swear by public competitions as a mean to hone their craft, as well as to gain public notice that might lead to employment. This is obviously not really an option for a beginner, but the thrill of competition can certainly provide good motivation for learning.

Kaggle, of course, is the best-known host of data competitions. At any given time, the company generally hosts a handful of "real" competitions (that is, those that give money prizes, and count for Kaggle ranking points and tiers, plus a larger number of "Knowledge" competitions intended for practice. The Knowledge competitions may award only bragging rights, but they're accessible to individuals or small groups with little experience in data science.

Innocentive hosts challenges from a broad array of disciplines. Many of these problems are tractable to data science.

TunedIT hosts competitions (it listed 32 as of January 2013) as a means of promoting its data mining platform.

21 comments:

AnonymousDecember 7, 2012 at 12:08 PM
This blog is a really helpful resource for people who wish to become data scientists. Good pieces of advice. Keep writing!
AnonymousDecember 18, 2012 at 1:29 AM
Someone needing soe data sets to experiment with could check out the USA CDC (Centers for Disease Control and Prevention) link below. You will haave to first a

CDC/National Center for Health Statistics
-----------------------------------------
The data files can be found here:

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/

The terms (i.e. the "Data Access - FTP - Data Users Agreement") and conditions under which the data sets are being made freely available to you are found here:

http://bit.ly/xecvTS

AnonymousDecember 18, 2012 at 3:26 PM
Code School - Try R ( http://bit.ly/UA7DbU )

"R is a tool for statistics and data modeling. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is more than that, though — it also includes extremely powerful graphics capabilities. If you want to easily manipulate your data and present it in compelling ways, R is the tool for you."

Link: http://bit.ly/UA7DbU, http://tryr.codeschool.com/

JohnatLavastormApril 5, 2013 at 5:07 PM
Hi Scott,

I came across your blog page here. Great list of resources!

I work at Lavastorm Analytics and big data analytics is what we do. We are always looking to hire. Would you mine posting our careers page as a resource for for this page - http://www.lavastorm.com/company/careers/

Thanks,

John
Jenna DutcherSeptember 22, 2013 at 11:15 PM
Hi, Scott,

Love the list of blogs you've got going here! Data Science 101 is one of my personal favorites. I work with the datascience@berkeley Master of Data Science degree program, and we have a blog that features interviews from thought leaders in the field as well as some background on different projects and areas that might be of interest to your readers. Would you mind adding the link to your list, in case your visitors want to check it out? The blog can be found at: http://datascience.berkeley.edu/blog

Keep up the great work!

Thanks,
Jenna
MTKnifeAugust 23, 2014 at 7:58 PM
I will be sure to add that. Thanks!
MTKnifeSeptember 13, 2014 at 3:57 PM
Sorry for the late reply, but I don't update this page as regularly as I should. While a computer science background can indeed by useful for data scientists, I'm not sure I see why someone entering the data science field would pick this degree, rather than one that's focused on data science topics. In checking a couple of the lists linked from this page, however, I noticed the Syracause does ofter a certifcate program in data science.
ArjunApril 24, 2015 at 3:43 AM
Hi. I work at CrowdAnalytix and we host data science contests too. Would you mind adding our site and information?
ajit royMarch 9, 2016 at 12:55 AM
informatice indeed
UnknownNovember 28, 2016 at 10:58 AM
Another addition could be SIRE Life Sciences, https://sire-search.com. Job postings & recruitment.
One Federal SolutionFebruary 7, 2020 at 3:21 AM
Thanks for the useful links. Such a great blog !
data scienceMarch 6, 2022 at 10:04 PM
All things considered I read it yesterday yet I had a few musings about it and today I needed to peruse it again in light of the fact that it is very elegantly composed.
Insight ITJune 3, 2026 at 5:53 AM
We appreciate you sharing valuable insights on this topic.
Thanks again for the detailed post — looking forward to more such content!
Digital Marketing Course in Hyderabad
Best Digital Marketing Course in Hyderabad
360digitmgmalaysiaJuly 10, 2026 at 8:53 AM
360DigiTMG's Data Science Courses in Malaysia provide the perfect platform for aspiring professionals to master essential data science skills. From data preparation to predictive modeling, the course covers every critical aspect needed to succeed in the field. Equip yourself with the knowledge and expertise required to pursue exciting roles in analytics and strategic decision-making. Data Science Courses in Malaysia
Anvayaa Kin CareJuly 16, 2026 at 8:42 AM
Nice info, If you're looking for At Home Elder Care In Mumbai Contact Anvayaa Elder Care. Call:+91 72888 18181 / +91 93926 82922
For USA Enquiry : +1 989 2682922

Labels

Useful Links

21 comments: