Useful Links

Below, I've included links to a number of useful resources for social scientists who want to become data scientists; to make the list even more useful, I've added a little bit of information about each link. If you have any suggestions for additional resources I can list, please let me know! I'd also like to be alerted to broken links, of course.

Note that you can use the links in the sidebar to jump to a specific section of this page from here or anywhere else in the blog.


Defining the Data Scientist—and Why Anyone Would Hire You as One

The articles below move from defining data science, to describing what data scientists do, to giving practical advice on becoming a data scientist:

"What Is Data Science? What Is Analytics? What Is a Data Scientist?"

Davenport and Patil, "Data Scientist: The Sexiest Job of the 21st Century (Executive Summary)"

Press, "Big Data News of the Week: Sexy and Social Data Scientists"

Miller, "Data as a Social Science"

Dyche, "Big Data 'Eurekas!' Don't Just Happen"

Roe, "So You Want to Be a Data Scientist?"

Koploy, "Three Career Secrets for Aspiring Data Scientists"

Koploy, "Advice for the Aspiring Data Scientist"

Roy, "What to Look for When Hiring a BI Specialist"

"On Becoming a Data Scientist", "Part 1—The Destination", "Part 2—The Technical Core, for Free", "Part 3—The Softer Side", and "Part 4—Managing". (Part 2, incidentally, has a long list of resources to help in learning R.)

A really great article offering practical advice on making a career change to data science:

Jain, "Planning a Late Career Shift to Analytics/Big Data? Better Be Prepared!"

(I can confirm the wisdom of a lot of Jain's points from my own experience.)

And last but not least, some advice (not specific to data scientists) for moving from academia to industry:

Khalilov, "A Guideline to Move from Academia to Industry—Part 1" and "A Guideline to Move from Academia to Industry—Part 2"

Wood, "The Ph.D.'s Guide to a Nonfaculty Job Search"

As the articles linked above make obvious, there's considerable disagreement over what a data scientist is and where they come from. Everyone agrees that data scientists attempt to extract useful information from big data (another term open to interpretation, by the way). However, some writers focus on how data scientists approach problems; they see data scientists as researchers who bring a particular way of seeking knowledge (the scientific method, with its rigorous approach to testing hypotheses through statistical analysis of experimental and quasi-experimental designs) to a new domain. Other writers focus on the computer skills required to use the tools that data scientists use; they see the data scientist as an improved version of the traditional data analyst, adding new tools such as MapReduce and Hadoop to the traditional skill set of SQL, Java, C++, and the like. There's also a third school of thought, which I've seen in job ads, that focuses on experience in marketing research; individuals with this experience often possess business degrees, though they typically have picked up some statistical and database skills as well.

Real data scientists fit all of these molds. More to the point, for an aspiring data scientist, real employers may subscribe to one vision or another, and post job ads that reflect that point of view. If you've been trained as a social scientist, odds are you don't have the 5+ years of SQL or C++ experience that some employers seek. On the other hand, even if we're talking about technical skills, someone trained as a database administrator probably doesn't know a whole lot about advanced econometrics or sampling theory. Data scientists from all three sorts of background bring useful skills to the table, and so, in the grand scheme of things, this isn't a matter of who's right and who's wrong. However, you need to find the employers and job postings that match the background you have, or at least, to make the case to an employer that you have the skills and aptitude to do the job.


Job Postings

Yeah, I know, this is the section you really wanted to see. Logically, jobs are something you should look at after preparing yourself with all the other resources linked below, but I'll concede to reality here. :)

For tech jobs in general, there's Dice.

More specialized sites:

DataJobs

AnalyticTalent

KDnuggets Jobs

Of course, you'll find more job postings on, say, Indeed, which aggregates ads from many different sites, including Dice (I'm not sure if it includes DataJobs, AnalyticTalent, or KDnuggets as well). The advantage of the more specialized job sites is that you can use more inclusive search terms without being overwhelmed by the numbers of hits. That being said, the search terms I've found most useful on Indeed are "'social science' research" (note the quotes), "statistician", and, obviously, "'data scientist'" (note the quotes again); "'data analyst'" tends to produce mostly jobs for traditional data analysts, specializing in computer skills.

For a different, and quite interesting, approach, try Hired, a headhunting firm that specializes in tech jobs, including data scientists; have a look at this article in Forbes for a full description. The catch is that Hired picks fewer than 10% of its applicants as candidates for its employer clients—but it promises 5 to 15 offers within a week for those who make the cut.


Professional Associations and Social Network Groups

The recently formed Data Science Association is the first professional association specifically devoted to the field. Yearly membership is free "for a limited time". The association holds events from time to time, but the website doesn't list any for 2017. The site features a decent online library, a weekly list of data science news stories, and, of considerable importance for the professionalization of the field, a code of conduct.

Three active LinkedIn groups deal with data science and big data:

Big Data / Analytics / Strategy / FP&A / S&OP / Strategic Planning / Predictive & Business Analytics

Data Mining, Statistics, Big Data, and Data Visualization

Research Methods and Data Science (RMDS)

The first group was established by IE. Analytics, and is the biggest of the three. The second group, while smaller, seems to have more members from outside the U.S. There's naturally some overlap between the two in content and membership. The third is the smallest of the three, and seems to lean a little towards academic topics of discussion.

Group members do a great job of posting links to the latest articles in the field, and of course, the groups are wonderful for social networking. The unstructured nature of LinkedIn group discussions, though (think Twitter without hashtags) can make it hard to look for information on specific topics, and links to popular articles are often posted several times by different group members, and in all three groups. I also find the job listings less than useful: those in the first group are not necessarily focused on data science, while those in the second, though more relevant, are also pretty small in number, and the third group has only a handful.

Also, the "Big Data" group is moderated, and seems to suffer a bit from strange moderation decisions that sometimes see informational posts placed in the little-viewed "Promotions" section rather than "Discussions" (a fate not shared by IE. Analytics' own promotions). Moroever, the presence in the group of several female employees of IE. Analytics who tend to "Like" practically every post seems reminiscent of the pharmaceutical companies' former practice of hiring ex-cheerleaders to sell drugs to (mostly male) doctors.

The Data Science community on Google+ also covers the topic, though it has only a handful of members.


Blogs

This list is very much incomplete, and I would certainly appreciate suggestions of additional blogs I can add.

We'll start with a site that aggregates multiple big data blogs, planet Big Data.

Ryan Swanstrom's Data Science 101, much like this blog, seeks to help those who want to become data scientists.

Gil Press's What's the Big Data? offers a wealth of current information on the field, with sections devoted to events, startups, interviews, and courses and graduate programs, in addition to the blog itself.

Academics may take particular interest in Zero Intelligence Agents, the blog of Drew Conway, co-author of Machine Learning for Hackers (see "Free and Cheap Learning Resources", formerly "Self-teaching Resources", below). This blog hasn't seen a new post since 2014, but contains good examples of the application of data science methods to practical problems, and, especially, creative visualizations of the results.

The eponymous blog of Conway's co-authur, John Myles White, also makes an interesting read.

The MIKE 2.0 blogs explore a wide variety of topics. I find Phil Simon's posts particularly interesting.

Carl Anderson's blog p-value.info features "[m]usings on data science, machine learning, and statistics". Anderson addresses these topics from a practical perspecive, sometimes even including code in his posts.

The anonymous BInalytics blog focuses mainly on technical subjects, but also features some big-picture posts.

Noam Ross's eponymous blog offers quite a bit of advice on using R, and also some examples of the author's own research.

Jenna Dutcher writes the datascience@berkeley Blog for Berkeley's Data Science program. The blog features short commentaries on interesting articles, books, and videos in the data science field, and links to the original works.

Jeff Leek and Roger Peng, two of the three Johns Hopkins biostatistics professors who teach Coursera's new Data Science "specialization", collaborate with Rafa Irizarry on the Simply Statistics blog. The authors have promised to feature top students from the specialization in the blog.

Tommy Jones' Biased Estimates covers a variety of data science topics, including good coverage of the goings-on of Data Community DC, which unites data scientists and their ilk in the Washington, DC area.


Free and Cheap Learning Resources (aka, Self-teaching Resources)

Odds are, if you're a social scientist working to become a data scientist, you're going to have to teach yourself quite a bit. Of course, if you've got a PhD, you're pretty smart to begin with, and the fact that you want to become a data scientist suggests that you're pretty technically savvy; the upshot is that you could learn all of the things you need to know pretty quickly on the job. Unfortunately, most employers don't think like that, and write job ads as if they think potential employees are incapable of learning once hired. This is not entirely stupid, as an employer can be sure that someone who already knows a given skill will be able to use it, without having to worry about how well that employee can learn new ones, but it does unnecessarily filter out a lot of people who might be very useful—and given the speed at which data science and its associated computer applications are evolving, anyone who can't learn new skills quickly is not going to be very useful anyway.

In any event, the reality is that the more you know going in, the more employable you'll be. You might not have years of experience, but you can easily teach yourself enough to pass certification exams, and by teaching yourself you can learn more cheaply and, usually, more quickly than if you took a formal course. Below, I've assembled a list of free or cheap learning resources (if you do take certification exams, they should be your main expense), and I hope to add to this list as time goes on.

The best place to start, in my opinion, is Stanford University's Databases (formerly "Introduction to Databases"), taught by Jennifer Widom and offered on the OpenEdX platform. This was originally a "massive open online course" (MOOC)—that is, thousands of people took it together, during a defined time period—but Stanford now offers the course material as a series of 14 self-paced mini-courses.

I was initially skeptical about this course, figuring that any broad survey of the field would touch on each topic too lightly to be of any pratical use. It turns out that I was wrong: yes, the course is an introductory one, and it covers a lot of topics, but on many of those topics, it goes into greater depth than specialized tutorials found elsewhere online, and it's particularly strong on SQL and database theory. The course provides exercises that pose real challenges, and presents them via an interactive platform that helps students to correct and learn from mistakes. Widom touts the course as being suitable for "a la carte" learning (hence the 14 mini-courses), but a novice will find all of the topics useful. You can see my full review of both this course and the Big Data University SQL course mentioned below here.

For me, one of the most useful resources for learning data science topics has been Coursera, which offers free MOOC's from well-known universities. Yes, there are plenty of tutorials and self-paced courses out there, but I find the deadlines provided by a MOOC to be a useful way to keep myself on track—even if there are no real consequences to missing them. Most of these courses are not for college credit, you may have to wait a while until the course you need starts, and many of the courses are surveys (including two courses on big data, Introduction to Data Science and Web Intelligence and Big Data), rather than focusing on more specific, practically useful topics, but Coursera promises a degree of academic rigor not found in the average online tutorial, as well as extra features such as machine- and peer-grading. The fact that co-founders Andrew Ng and Daphne Koller research machine learning bodes well for offerings related to data science.

The anonymous author of the BInalytics blog has identified a number of Coursera courses on data science topics that teach their students to program in R as part of the curriculum: Statistics One (probably not of much use to a social scientists with a quantitative background), Data Analysis, Computing for Data Analysis, Social Network Analysis, Mathematical Biostatistics Boot Camp, and Introduction to Computational Finance and Financial Econometrics. For other languages, he recommends Computational Methods for Data Analysis (MATLAB), Probabilistic Graphical Models (Octave/MATLAB), Passion Driven Statistics (SAS), and Computational Investing (Python); I should add that Andrew Ng's own Machine Learning course also uses Octave. The opportunity to study a substantive data science topic while learning a programming language at the same time is a good two-for-one deal.

Coursera also offers (non-programming) courses on business, which is an area that those interested in becoming data scientsts should not neglect. BInalytics mentions Financial Engineering and Risk Management as being potentially useful to data scientists, and I took the almost entirely non-technical, but quite interesting, Foundations of Business Strategy.

Recently, Coursera has moved toward paying for extra features or, in a few cases, toward courses that cannot be taken for free. The capstone course of Coursera's new Data Science specialization falls into this category, but the other courses in the specialization can still be taken for free. For more details, see below under "Paid Courses, Certificates, and Degrees". In addition, check out the specialization's GitHub repository to find copies of the courses' lecture notes, which make up an excellent reference source for anyone using R to solve data science problems.

In the same vein as Coursera are Udacity and edX, the latter of which, like Coursera, offers courses in partnership with pretigious universities. At the moment, both services have more limited offerings than Coursera, but their catalogs are growing quickly, and Udacity offers a short course on Hadoop, a topic not covered by Coursera. In the beginning, their programming courses tended either to be very basic or to survey their topics at a general, conceptual level, but these offerings will be useful to those without a lot of programming experience, and their newer offerings include more advanced courses (Udacity also has one business course). Udacity's courses, it should be noted, are not stricly speaking MOOC's—like an online tutorial, you work at your own pace. However, for a fee, Udacity offers some of the interactivity that comes with a MOOC—see below, under "Paid Courses, Certificates, and Degrees".

Stanford University has adopted a policy of offering online classes through a variety of different outlets, including Coursera. Stanford's in-house efforts began with Class2Go, which offered three courses, including Widom's Introduction to Databases, which I described above. The University's latest effort uses the open-source OpenEdX platform; a growing list of courses began with an iteration of Databases, as well a new offering called Statistical Learning. The latter course, taught by Trevor Hastie and Rob Tibshirani, seems to cover much the same ground as Ng's Machine Learning, but using R rather than MATLAB/Octave. Hastie and Tibshirani (with Gareth James and Daniela Witten) co-authored a book on the same material, An Introduction to Statistical Learning, with Applications in R, and in conjunction with the course, the book's publisher made a PDF version available for free, to members of the public as well as students in the course.

Stanford also offers free courses through iTunes U, including a version of Ng's Machine Learning.

Finally, both KDnuggets and Udemy list a few free online courses among many more paid ones.

Big Data University offers a number of useful courses, including free courses on SQL, Java, Pig, Hive, and Hadoop (Remember, it's fun to say "Hadoop"!)—most of the courses are free, but very introductory. These are not MOOC's, but rather self-directed, at-your-own-pace tutorials. Big Data University's first SQL course is adequate for introducing the fundamentals of the query language, and also covers a lot of database theory, but the Stanford course is better on both scores, and the Big Data course has nothing to compare to the interactive exercises in the Stanford offering. One merit of the Big Data University course is that gives the student practical familiarity with setting up and using a common SQL package, IBM's DB2 Express-C, which can be downloaded for free. Another merit of the SQL course is that it's offered in Polish, Portuguese, Russian, and Spanish, as well as English; a second course is offered in English, Portuguese, and Spanish.

MySQL Tutorial, as one might expect from the name, features many useful tutorials on Oracle's popular open-source (and free) database program. Don't be put off by the writer's questionable grammar (he's obviously not a native speaker) or sometimes odd organization.

Download MySQL and the MySQL Reference Manual at the MySQL Developer Zone.

Dubois, Hinz, & Pedersen's MySQL 5.0 Certification Study Guide (available from both Amazon and Barnes & Noble—there are also Kindle and Nook editions) comes highly recommended, once you've learned MySQL and are ready to take the Developer exams; its biggest selling point is that it was written by the authors of the exams. (In the interest of full disclosure: yes, those are affiliate links.)

Oracle's New to Java Programming Center provides a good start with that language, including tutorials.

Code Academy offers a number of short, interactive tutorials that cover each of several programming languages, including Python.

Kevin Sheppard's Introduction to Python for Econometrics, Statistics and Data Analysis, a free ebook, provides a guide to the popular scripting language that's especially relevant for our purposes. Note that there's an eariler, incomplete version of the book on the web—make sure to use this link for the most recent version.

For learning R, Data Camp offers a small but growing collection of interactive tutorials, two of which constitute the programming exercises of Coursera courses: Eric Zivot's Introduction to Computational Finance and Financial Econometrics and Mine Çetinkaya-Rundel's Data Analysis and Statistical Inference. (The statistics taught in the latter course are probably a little basic for anyone coming from a social science background—the most advanced topic is multiple regression.) Taking the Coursera courses signs you up for their Data Camp components, but you can also take the Data Camp courses by themselves, whether or not the Coursera courses are being offered at the time. Data Camp also plans offerings with Revolution Analytics and RStudio in the near future.

The BInalytics blog recommends Jones, Maillardet, and Robinson's slightly pricey textbook Introduction to Scientific Programming and Simulation Using R, available from Amazon (and also in a Kindle version) and Barnes & Noble (there's no Nook version available).

Conway and White's Machine Learning for Hackers insists it's not a guide for learning R, but you wouldn't be the first person to use it as a way to learn R while also studying machine learning; it has an accompanying website with code samples and other goodies. You can of course get it at Amazon or Barnes & Noble; both Kindle and Nook editions are available, and O'Reilly also offers an upgrade option to receive updates and non-DRM copies (for the Nook edition, at least, which is the one I own).

In a similar vein is Torgo's Data Mining with R: Learning with Cases Studies, which the author of the BInalytics blog recommends highly for its case approach, though he does find it less challenging than Conway and White's book. It too is available at Amazon (and also in a Kindle version) and Barnes & Noble (again, there's no Nook version).

A free alternative for studying both machine learning and R is James, Witten, Hastie, and Tibshiranti's An Introduction to Statistical Learning, with Applications in R.

The R Project for Statistical Computing is a source for all things R.

RSeek can help you find additional information on a language whose name gives search engines fits.

Code School offers a very basic but very accessible online course called Try R, complete with a pirate theme and badges for completing each chapter. Completing the course takes only a couple of hours, at most, and on completion you'll be offered discounts on O'Reilly ebooks (50%) and print books (40%). A course on Ruby and a zombie-themed course on Ruby on Rails might also be relevant to a data scientist. All three of these are free, but Code School also offers paid courses.

You might also check out two posts by Noam Ross, one a recount of a talk on debugging tools in R, and the other a very practical seet of recommendations for speeding up R code.

For using R with big data, BInalytics recommends a series of tutorials posted by Jeffrey Breen on his Things I Tend to Forget blog. These tutorials make use of the RHadoop packages published by Revolution Analytics.

For those interested in more advanced topics, a list of free e-books can be found in Carl Anderson's blog p-value.info.

In a similar vein, Ryan Swanstrom has posted a list of free data science e-journals on his Data Science 101 blog.

Machine Learning Surveys bills itself as a "list of literature surveys, reviews, and tutorials on Machine Learning and related topics". As of January 2013, it listed 123 resources.


Paid Courses, Certificates, and Degrees

Mind you, there's also something to be said for taking formal classes: you've got something to put on your resume, you don't have to worry as much about motivating yourself, and you might even be able to get a recommendation from a teacher impressed by your aptitude. On top of all that, many courses include certification exams for "free". If you can afford the time and money for classes, they might well be a good option.

KDnuggets also lists a wide range of short training courses and university programs, as well as featuring sections on software, news, conferences, and even publicly available datasets, among a number of other things.

Udemy offers many paid but inexpensive (and occasionally free) courses on subjects relevant to data science. The site is notable for offering several, competing courses on many topics, with user reviews to help you make choices. One of my readers has recommended this $59 course on Java, though Udemy has a number of other offerings on that programming language.

Both Coursera and Udacity have moved recently toward pay models. Reports of cheating have darkened Coursera's reputation somewhat. Its "Signature Track" purportedly seeks to address this issue by requiring stringent identification procedures (including typing style detection) and a fee of $30-$100 in exchange for a "verified" certificate that can be shared with a college or employer (this is currently available only for a few courses, though you can still take those courses in the normal, free way).

Given the doubtful efficacy of these identification measures (they wouldn't, for example, prevent a student from uploading an assignment file created by someone else), whether employers will place any stake in these verifiable certificates (as opposed to a simple line on a resume listing the course) remains an open question; their real value at present probably lies in a concurrent effort that has obtained American Council on Education (ACE) Collge Credit Recommendation Service (CREDIT—no, I don't know how that acronym works) recommendation for some of the introductory-level courses, a step that will allow students to gain college credit, for an additional fee assessed for an online Credit Exam. However, most readers of this blog are probably not undegraduate students taking introductory courses, and it's doubtful that credit will be offered by universities for the more advanced courses of interest to an aspiring data scientist (indeed, there are several courses offered with a Signature Track that aren't among those recommended by ACE CREDIT).

More recently, Coursera has begun to offer paid tutoring through Google's Helpouts (I'm not clear on whether Coursera is getting any revenue from this).

In addition, the company has begun to offer "specializations", each of which feature a series of short courses, followed by a capstone project of some sort. If you take all of the coures on the Signature Track, and then finish the capstone project, you'll receive a certificate for the Specialzation. For example, the new Data Science specialization, offered in conjunction with Johns Hopkins, includes nine short courses and a capstone, each for $49, for a total of $490, including the opportunity to retake failed courses for up to two years; the first iteration will run from Apr. 7 to roughly the end of July. All of the courses are also available for free, but you can't enroll for the capstone project unless you pay for the entire specialization.

The Data Science specialization, which uses R and Git throughout, looks to be fairly useful, but much of the specialization concentrates on statistics and good scientific practice, which, while quite valuable, are already familiar to most social scientists. It also has little coverage of databases, other than how to read data from them using R. On the other hand, the Getting and Cleaning Data course offers to impart some very practical skills, such as using API's and web scraping to extract data from the web; it's also the course that covers extracting data from databases.

Udacity offers its at-your-own-pace courses for free, but for a third of its courses, including the ones most useful for aspring data scientists, offers extra services (called "Full Courses"), such as coaching and product feedback, for a "subscription" fee, typically on the order of $150 per month per course (meaning that the faster you finish the course, the less you pay).

Similar to Coursera's specializations, Udacity has recently announced four upcoming "nanodegrees". The Data Analytics nanodegree, which is "[p]roduced in collaboration with" AT&T, Clouderea, Facebook, and MongoDB, is likely aimed at the same audience as Coursera's Data Science specialization. Nanodegrees will take 6-12 months to complete, at a cost of $200 a month.

Even more intriguing is Coursera's Master of Science in Computer Science, offered in conjunction with Georgia Tech and AT&T. The selection of courses on offer is more limited that that found on campus, but the cost, at under $7,000, is about a third of what they'd pay for the on-campus version. Students must gain admittance to Tech (coincidentally, my own alma mater), and the degree program will soon be taking applications for Fall 2015. I'm not clear on why applications have to be taken nearly a year in advance, and I've certainly seen other online programs that don't require that much lead time for applicants.

The Georgia R School offers 10 courses on R for $95 each, with monthly and yearly memberships available. There's a discount for a students, and the school offers a 14-day free trial.

Cloudera offers not only online courses, but certifications as well, for its Apache Hadoop platform.

An increasing number of universities offer master's degrees or graduate certificates in data science, data mining, or business analytics. These problems offer the allure of a solid credential, as with business degrees, costs tend to be high, and financial aid (other than student loans) scarce.

Doug Henschen of Information Week has catalogued what he regards as the top 20 master's degrees in the field, with a mention of 10 additional programs, and the promise of more to come. Many of the universities offer part-time and/or online curricula, and many offer (shorter, cheaper) certificates as well as degrees.

Probably the most complete list (with over 200 programs) is Ryan Swanstrom's Data Science Colleges, spun out of his Data Science 101 blog. Other lists can be found in Gil Press's What's the Big Data? blog and on the homepage of North Carolina State University's Insitute for Advanced Analytics.

You might also consider applying for the Insight Data Science Fellows Program, especially if you're a PhD candidate or new graduate who wants to work in the Bay Area of California. This is a six-week program that's project-based (rather than classroom-based) and includes mentoring and interviews with top Silicon Valley companies.


Data Sources

Do you find it hard to motivate yourself to practice your skills on canned exercises? If so, download some real-world data and get your hands dirty.

One of the easiest's ways to access large amounts of data is Quandl, which has collected over 10 million economic and social datasets from over 500 sources from around the world. You can browse data in Quandl's web interface, and download it either in standard formats or through an API supported by libraries and plugins for a wide variety of languages and applications. Data Camp (mentioned above, under "Free and Cheap Learning Resources") offers a course called "How to Work with Quandl in R".

Below, you'll find a list of data sources that I assembled some time ago. Since then, I've found a list that puts mine (and probably anyone else's) to shame: Jeffrey Leek provides it in the last lecture of his Getting and Cleaning Data course on Coursera, part of the Data Science specialization. You can find his long list of links in the notes for that lecture, which is titled "Data Resources". The notes, incidentally, are an HTML5 document—you'll need to use the "Page Down" button to scroll through them. You can access theese without signing up on Coursera or joining the class, though you would have to do so to watch the lecture itself.

One problem with finding data is that many websites that provide public access to databases do so only through web interfaces that, while often sporting impressive visualization tools, don't allow for serious statistical analysis (probably by design—organizations often have no desire to give up proprietary data for free). If you really want to analyze data, you need to be able to download an entire dataset (or a subset of it). All of the sites below allow free downloads of databases in one form or another. Incidentally, I do need to note that I shamelessly took the first three of these sites from Gil Press's October, 2012 article on Foreign Policy's website, "10 Big Data Sites to Watch".

The U.S. government's Data.gov offers approximately three zillion datasets for your analytical pleasure (well, actually, close to 400,000, but that's still more than you can examine in your lifetime).

DataMarket offers an intriguing variety of both government and industry data. As far as I can tell, datasets can be downloaded only through an API (at least, for free users), rather than in ASCII or Excel form, but data from different sets can be combined, and, quite usefully, DataMarket provides links to the providers of the data, making alternate methods of download possible in many cases.

As mentioned above, KDnuggets lists a number of sites with publicly available datasets. These include images, blog posts, and even songs.

The U.S. Census Bureau provides a variety of demographic and economic data. The website's organization leaves something to be desired, but look here, here, and here for downloadable datasets.

The UCI Machine Learning Repository, maintained by UC Irvine's Center for Machine Learning and Intelligent Systems, houses 235 datasets as of January 2013.

If you're interested in politics, try the data from the American National Election Studies, widely considered the most important series of surveys of U.S. voters.

If health is more your cup of tea, the National Center for Health Statistics, part of the Centers for Disease Control and Prevention, provides an interesting range of data on its FTP server.


Contests

Many devotees of data science swear by public competitions as a mean to hone their craft, as well as to gain public notice that might lead to employment. This is obviously not really an option for a beginner, but the thrill of competition can certainly provide good motivation for learning.

Kaggle, of course, is the best-known host of data competitions. At any given time, the company generally hosts a handful of "real" competitions (that is, those that give money prizes, and count for Kaggle ranking points and tiers, plus a larger number of "Knowledge" competitions intended for practice. The Knowledge competitions may award only bragging rights, but they're accessible to individuals or small groups with little experience in data science.

Innocentive hosts challenges from a broad array of disciplines. Many of these problems are tractable to data science.

TunedIT hosts competitions (it listed 32 as of January 2013) as a means of promoting its data mining platform.

116 comments:

  1. This blog is a really helpful resource for people who wish to become data scientists. Good pieces of advice. Keep writing!

    ReplyDelete
  2. Someone needing soe data sets to experiment with could check out the USA CDC (Centers for Disease Control and Prevention) link below. You will haave to first a

    CDC/National Center for Health Statistics
    -----------------------------------------
    The data files can be found here:

    ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/


    The terms (i.e. the "Data Access - FTP - Data Users Agreement") and conditions under which the data sets are being made freely available to you are found here:

    http://bit.ly/xecvTS


    ReplyDelete
    Replies
    1. I think you might have pasted the wrong URL to the Data Users Agreement, but there's a link to that agreement on the Center for National Health Statistics page mentioned above.

      Delete
    2. The correct URL for the UA is: http://1.usa.gov/XUp5Na

      Delete
  3. Code School - Try R ( http://bit.ly/UA7DbU )

    "R is a tool for statistics and data modeling. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is more than that, though — it also includes extremely powerful graphics capabilities. If you want to easily manipulate your data and present it in compelling ways, R is the tool for you."

    Link: http://bit.ly/UA7DbU, http://tryr.codeschool.com/


    ReplyDelete
    Replies
    1. Oh, that's cute, complete with little badges for completing the chapters. I'll add it.

      Delete
  4. Hi Scott,

    I came across your blog page here. Great list of resources!

    I work at Lavastorm Analytics and big data analytics is what we do. We are always looking to hire. Would you mine posting our careers page as a resource for for this page - http://www.lavastorm.com/company/careers/

    Thanks,

    John

    ReplyDelete
    Replies
    1. Sure—I've just added the link. I'm a little curious, though: Lavastorm is a producer of business intelligence/analytics software, which is something that would seem to call for people more on the programming than on the social science side of data science. In that case, why the interest in a blog like mine?

      Delete
  5. Hi, Scott,

    Love the list of blogs you've got going here! Data Science 101 is one of my personal favorites. I work with the datascience@berkeley Master of Data Science degree program, and we have a blog that features interviews from thought leaders in the field as well as some background on different projects and areas that might be of interest to your readers. Would you mind adding the link to your list, in case your visitors want to check it out? The blog can be found at: http://datascience.berkeley.edu/blog

    Keep up the great work!

    Thanks,
    Jenna

    ReplyDelete
    Replies
    1. It took me a while to update this page, but I've added a link to your blog now. :)

      Delete
    2. Hi, Scott,

      Dropping you a line again to share that we recently published a project called "What is Big Data?" which compiles a comprehensive list of "Big Data" definitions from 40+ thought-leaders in the data science field. We feature people like Hilary Mason, Drew Conway, Hal Varian, Gregory Piatetsky, and many others. We knew the term was vague so we figured the best way to get a handle on it was to ask those who are immersed in the field.

      You can check it out here: http://datascience.berkeley.edu/what-is-big-data/

      This seemed very much in line with what you write about on your blog, so I wanted to make sure you were aware of it. :) Hope you enjoy.

      Thanks,
      Jenna

      Delete
  6. I will be sure to add that. Thanks!

    ReplyDelete
  7. Sorry for the late reply, but I don't update this page as regularly as I should. While a computer science background can indeed by useful for data scientists, I'm not sure I see why someone entering the data science field would pick this degree, rather than one that's focused on data science topics. In checking a couple of the lists linked from this page, however, I noticed the Syracause does ofter a certifcate program in data science.

    ReplyDelete
  8. Hi. I work at CrowdAnalytix and we host data science contests too. Would you mind adding our site and information?

    ReplyDelete
  9. Another addition could be SIRE Life Sciences, https://sire-search.com. Job postings & recruitment.

    ReplyDelete
  10. If you are interested in working on data science datasets that can help refine your methods and techniques you may want to investigate http://societyofdatascientists.com/datasets/ there are no restrictions on data usage.

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
  12. This comment has been removed by the author.

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. It’s really nice information to share here. Thanks for your blog, keep posting like this regularly. Thank you

    data science certification course training

    ReplyDelete
  15. Your Post is very useful,168 8099 apk scr888 casino game 4 I am truly happy to post my note on this blog . It helped me 918kiss malaysia apk with ocean of awareness so I really consider you will do much better in the future.

    ReplyDelete
  16. Awesome post. You Post is very informative. Thanks for Sharing.
    R Programming Course in Noida

    ReplyDelete
  17. Wow, what a blog! I mean, you just online casino malaysia for android have so much guts to go ahead and tell it like it is. Youre what blogging needs, an open minded superhero who isnt afraid to tell it like it is. This is definitely something people need to be up on. Good luck in the future, man

    ReplyDelete
  18. I think most people would agree with your article. I am going to bookmark joker123 test id this web site so I can come back and read more articles. Keep up the good work!

    ReplyDelete
  19. Enjoyed reading the article above, really explains everything in detail, the article is very interesting and effective. Thank you and good luck for the upcoming articles machine learning training

    ReplyDelete
  20. Data science is one of the top course in todays career. Your content will going to helpful for all the beginners who are trying to find best data science training in bangalore. Thanks for sharing useful information. keep updating.

    ReplyDelete
  21. Great Blog. Your blog contain helpful information
    for people who wish to become data scientist.

    Best Data Science course in Mumbai

    ReplyDelete
  22. Thanks for sharing your valuable information and time. 
    Machine Learning Training in Delhi

    ReplyDelete
  23. Very well. So, Top Tutor Bay is one of the top academic writing websites to provide assistance to students in their essays, research papers, assignments, term papers, dissertations, PowerPoint presentations, etc. Click now Coursework Writing Services in UK, USA, Australia

    ReplyDelete
  24. Thanks for the useful links. Such a great blog !

    ReplyDelete

  25. Thanks for sharing your innovative ideas to our vision. I have read your blog and I gathered some new information through your blog. Your blog is really very informative and unique. Keep posting like this. Awaiting for your further update.If you are looking for any Data science related information, please visit our website Data Science Training in Bangalore

    ReplyDelete
  26. Nice information . Your information is really useful to students who are looking for data science coures.Thanks for sharing with us.

    ReplyDelete
  27. This comment has been removed by the author.

    ReplyDelete
  28. Wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
    Online Data Science Training in Pune, Mumbai, Delhi NCR

    ReplyDelete
  29. Thanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuff like this.
    Data Science Training in Pune

    ReplyDelete
  30. Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article.
    Data Science Training in Bangalore

    ReplyDelete
  31. Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
    data science institute in hyderabad
    data analytics training in hyderabad
    business analytics course in hyderabad

    ReplyDelete
  32. Very awesome!!! When I seek for this I found this website at the top of all blogs in search engine.data science course in malaysia

    ReplyDelete
  33. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    360digitmg data science course in india


    ReplyDelete
  34. This blog is so nice and more informative i like it more......
    Join 360DigiTMG for best Data Science Course in Hyderabad and become a professional Data Scientist in hyderabad with hands-on experience on real-time projects in just 4 months. Enhance your career with data science courses in hyderabad.

    data science course in hyderabad
    data scientist courses in hyderabad
    data scientist course in hyderabad
    data science courses in hyderabad
    data science course hyderabad
    data science institute in hyderabad
    best data science course in hyderabad

    ReplyDelete
  35. http://newdatascientist.blogspot.com/p/useful-links.html#sthash.aOjwzILV.dpbs

    ReplyDelete
  36. Thanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuff like this.
    R Programming Training in Bangalore

    ReplyDelete
  37. Thanks for sharing such an awesome article. It helps me a lot. Business Analytics MBA are designed to help students hone their analytical and critical-thinking skills through core courses such as Macroeconomics Analysis, Market Analysis and Management Science.

    ReplyDelete

  38. Nice to be seeing your site once again, it's been weeks for me. This article which ive been waited for so long. I need this guide to complete my mission inside the school, and it's same issue together along with your essay. Thanks, pleasant share.
    Data Science Course In Bangalore With Placement

    ReplyDelete
  39. Wonderful blog & good post.Its really helpful for me, awaiting for more
    new post. Keep Blogging! Free data visualization software Trial

    ReplyDelete
  40. Stunning! Such an astonishing and supportive post this is. I incredibly love it. It's so acceptable thus wonderful. I am simply astounded.

    data science course

    ReplyDelete
  41. This is a great post I saw thanks to sharing. I really want to hope that you will continue to share great posts in the future.
    data science course in noida

    ReplyDelete
  42. Thank you for sharing the article. The data that you provided in the blog is informative and effective.

    DevOps Training in Hyderabad

    ReplyDelete
  43. This comment has been removed by the author.

    ReplyDelete
  44. Set aside my effort to peruse all the remarks, however I truly delighted in the article. It's consistently pleasant when you can not exclusively be educated, yet in addition, engaged!
    360DigiTMG iot classes

    ReplyDelete
  45. Thanks for sharing this amazing post
    We provide Classroom training on IBM Certified Data Science at Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices. We are also available over the call or mail or direct interaction with the trainer for active learning.
    data science course training in Hyderabad

    Data Science Course in Hyderabad

    ReplyDelete
  46. We provide Classroom training on IBM Certified Data Science at Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices. We are also available over the call or mail or direct interaction with the trainer for active learning.

    For any queries feel free to Call/WhatsApp us on +91-9951666670 or mail at info@innomatics.in

    data science course training in hyderabad

    ReplyDelete
  47. Thankyou for this wondrous post, I am cheerful I watched this site on yahoo.
    https://360digitmg.com/india/iot-course-training-in-noida

    ReplyDelete

  48. Thanks for the detailed blog.The blog consisit of the informational data of what a user search.To get certification for the data science developer at the best price from the global tech council. We deliver aim to get to give a high bench service at a pocket-friendly price. Contact us now.

    Certified Data Science Developer

    ReplyDelete
  49. We provide Classroom training on IBM Certified Data Science at Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices. We are also available over the call or mail or direct interaction with the trainer for active learning.

    For any queries feel free to Call/WhatsApp us on +91-9951666670 or mail at info@innomatics.in

    data science training in hyderabad
    Data Science Course in Hyderabad

    ReplyDelete
  50. We provide Classroom training on IBM Certified Data Science at Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices. We are also available over the call or mail or direct interaction with the trainer for active learning.
    For any queries feel free to Call/WhatsApp us on +91-9951666670 or mail at info@innomatics.in
    data science training in hyderabad
    Data Science Course in Hyderabad

    ReplyDelete
  51. Thanks for the detailed blog.The blog consist of informational content about the topic.I really appreciate the blog post.YOu may also visit to the Global tech council to get the best deal.

    Just click- Data science certificate online

    ReplyDelete
  52. Nice blog, it's so knowledgeable, informative, and good looking site. I appreciate your hard work. Good job. Thank you for this wonderful sharing with us.data science course in Hyderabad

    ReplyDelete
  53. Incredibly conventional blog and articles. I am realy very happy to visit your blog. Directly I am found which I truly need. Thankful to you and keeping it together for your new post.
    data science course malaysia

    ReplyDelete
  54. This comment has been removed by the author.

    ReplyDelete
  55. Thank you for sharing the article. The data that you provided in the blog is informative and effective.

    Servicenow Training in Hyderabad

    ReplyDelete
  56. This comment has been removed by the author.

    ReplyDelete
  57. wow really superb you had posted one nice piece of information through this. Definitely, it will be useful for many people. So please keep update like this.
    Data Science Training Pune

    ReplyDelete
  58. Wow that was odd. I just wrote an incredibly long comment but after I clicked submit my comment didn’t show up. well I’m not writing all that over again. Regardless, just wanted to say superb blog!
    data scientist training and placement

    ReplyDelete
  59. It was a good experience to read about dangerous punctuation. Informative for everyone looking on the subject.
    data scientist course in hyderabad

    ReplyDelete
  60. I am sure it will help many people. Keep up the good work. It's very compelling and I enjoyed browsing the entire blog.
    Best Data Science Courses in Bangalore

    ReplyDelete
  61. check our blog MBA in Artificial Intelligence if anyone having a keen interest in artificial intelligence

    ReplyDelete
  62. This comment has been removed by the author.

    ReplyDelete
  63. I am delighted to discover this page. I must thank you for the time you devoted to this particularly fantastic reading !! I really liked each part very much and also bookmarked you to see new information on your site.

    Business Analytics Course

    ReplyDelete
  64. This comment has been removed by the author.

    ReplyDelete
  65. This post is very simple to read and appreciate without leaving any details out. Great work!
    data science courses in chennai

    ReplyDelete
  66. Nice work, truly valuable to me.
    I hope you keep it up.
    We are offering best offshore development services then,
    Visit here:
    Iyrix Technologies
    Remote Software Developers
    Software Development Services

    ReplyDelete
  67. It's like you've got the point right, but forgot to include your readers. Maybe you should think about it from different angles.


    Best Cyber Security Training Institute in Bangalore

    ReplyDelete
  68. nice post ,thanks for sharing nice blog if like to read more visit it https://duckcreektraining.com/

    ReplyDelete
  69. Very good message. I came across your blog and wanted to tell you that I really enjoyed reading your articles.


    Artificial Intelligence Courses in Bangalore

    ReplyDelete
  70. really nice and informative article.
    At Ali’s Academy, we provide a unique learning environment which drives sustained academic success and personal growth. Our tuition methods are based on Engaging and Empowering students in order to deliver sustained academic out-performance.

    We offer extra curriculum Course for SATS and GCSE Exams, Bespoke 1-2-1 sessions tailored for each student and 11 Plus Exams Preparation in Slough and High Wycombe, UK.

    Ali’s Academy is OFSTED registered. This allows our members to take advantage of savings on fees through numerous government support schemes.

    ReplyDelete
  71. Thanks for sharing your precious time to create this post, it's so informative, and the content makes the post more interesting.really appreciated. Camille Razat Emily In Paris S02 Blazer

    ReplyDelete
  72. It's like you've got the point right, but forgot to include your readers. Maybe you should think about it from different angles.


    Data Science Course in Kolkata

    ReplyDelete
  73. Thanks for sharing this article that will help beginners who want to start their career as a Data Scientist and also visit The best Data Science Training Course in Delhi for Training with placements assurance.

    ReplyDelete
  74. Really an awesome blog and very useful information for many people. Keep sharing more blogs again soon. Thank you.
    Online Data Science Course in Hyderabad

    ReplyDelete
  75. All things considered I read it yesterday yet I had a few musings about it and today I needed to peruse it again in light of the fact that it is very elegantly composed.

    ReplyDelete
  76. 360DigiTMG, the top-rated organisation among the most prestigious industries around the world, is an educational destination for those looking to pursue their dreams around the globe. The company is changing careers of many people through constant improvement, 360DigiTMG provides an outstanding learning experience and distinguishes itself from the pack. 360DigiTMG is a prominent global presence by offering world-class training. Its main office is in India and subsidiaries across Malaysia, USA, East Asia, Australia, Uk, Netherlands, and the Middle East.

    ReplyDelete
  77. Thank you once again for your love and willingness to share your feelings
    SEO Firm Chicago
    Digital Evrima

    ReplyDelete
  78. At first definition of both separately, and then you will read the difference between R and Python. So let's dive into it.data science course in dombivli

    ReplyDelete
  79. Thanks for posting the best information and the blog is very important for us. Please check here.

    ReplyDelete
  80. This comment has been removed by the author.

    ReplyDelete
  81. Thank for sharing such informational blog. To study abroad you need to start with IELTS course. If you want to fulfill your dream of studying abroad dreams.

    ReplyDelete
  82. Without data analytics, you cannot imagine data science. In this process, data is examined to transform it into a meaningful aspect.
    data science course in patna

    ReplyDelete
  83. Data Science Course in Noida
    https://aptronsolutionsblog.mystrikingly.com/blog/data-science-certification-for-it-leaders-looking-to-get-ahead-aptron-solutions

    ReplyDelete
  84. Nice thanks for sharing informative post like this keep posting if like more details visit my website sclinbio.com

    ReplyDelete
  85. Upgrade your career clinical data mange ment from industry experts gets complete hands on servicess, on our sclinbio.

    ReplyDelete

  86. Thank you for sharing! I always appreciate engaging with high-quality content that offers valuable insights. The presented ideas are not only excellent but also incredibly innovative, making your post a truly enjoyable read. Keep up the fantastic work.
    visit:
    HTML Block & Inline
    HTML Classes
    HTML Id
    HTML Iframes
    HTML JavaScript
    HTML File Paths
    HTML Head
    HTML Layout
    HTML Responsive
    HTML Computercode
    HTML Semantics
    HTML Style Guide
    HTML Entities
    HTML Symbols
    HTML Emojis
    HTML Charset
    HTML URL Encode
    HTML vs. XHTML

    HTML Forms
    HTML Forms
    HTML Form Attributes
    HTML Form Elements
    HTML Input Types
    HTML Input Attributes
    HTML Input Form Attributes

    HTML Graphics
    HTML Canvas
    HTML SVG

    HTML Media
    HTML Media
    HTML Video
    HTML Audio
    HTML Plug-ins
    HTML YouTube

    HTML APIs
    HTML Geolocation
    HTML Drag/Drop
    HTML Web Storage
    HTML Web Workers
    HTML SSE

    HTML Examples
    HTML Examples
    HTML Editor
    HTML Quiz
    HTML Exercises
    HTML Website
    HTML Bootcamp
    HTML Certificate
    HTML Summary
    HTML Accessibility

    HTML References
    HTML Tag List
    HTML Attributes
    HTML Global Attributes
    HTML Browser Support
    HTML Events
    HTML Colors
    HTML Canvas
    HTML Audio/Video
    HTML Doctypes
    HTML Character Sets
    HTML URL Encode
    HTML Lang Codes
    HTTP Messages
    HTTP Methods
    PX to EM Converter
    Keyboard Shortcuts


    HTML Links
    Links are found in nearly all web pages. Links allow users to click their way from page to page.

    HTML Links - Hyperlinks
    HTML links are hyperlinks.

    You can click on a link and jump to another document.

    When you move the mouse over a link, the mouse arrow will turn into a little hand.

    Note: A link does not have to be text. A link can be an image or any other HTML element!

    HTML Links - Syntax
    The HTML tag defines a hyperlink. It has the following syntax:

    Advanced Python Techniques: Unleash the Power of Python

    ReplyDelete