Below, I've included links to a number of useful resources for social scientists who want to become data scientists; to make the list even more useful, I've added a little bit of information about each link. If you have any suggestions for additional resources I can list, please let me know! I'd also like to be alerted to broken links, of course.
Note that you can use the links in the sidebar to jump to a specific section of this page from here or anywhere else in the blog.
Defining the Data Scientist—and Why Anyone Would Hire You as One
The articles below move from defining data science, to describing what data scientists do, to giving practical advice on becoming a data scientist:
"What Is Data Science? What Is Analytics? What Is a Data Scientist?"
Davenport and Patil, "Data Scientist: The Sexiest Job of the 21st Century (Executive Summary)"
Press, "Big Data News of the Week: Sexy and Social Data Scientists"
Miller, "Data as a Social Science"
Dyche, "Big Data 'Eurekas!' Don't Just Happen"
Roe, "So You Want to Be a Data Scientist?"
Koploy, "Three Career Secrets for Aspiring Data Scientists"
Koploy, "Advice for the Aspiring Data Scientist"
Roy, "What to Look for When Hiring a BI Specialist"
"On Becoming a Data Scientist", "Part 1—The Destination", "Part 2—The Technical Core, for Free", "Part 3—The Softer Side", and "Part 4—Managing". (Part 2, incidentally, has a long list of resources to help in learning R.)
A really great article offering practical advice on making a career change to data science:
Jain, "Planning a Late Career Shift to Analytics/Big Data? Better Be Prepared!"
(I can confirm the wisdom of a lot of Jain's points from my own experience.)
And last but not least, some advice (not specific to data scientists) for moving from academia to industry:
Khalilov, "A Guideline to Move from Academia to Industry—Part 1" and
"A Guideline to Move from Academia to Industry—Part 2"
Wood, "The Ph.D.'s Guide to a Nonfaculty Job Search"
As the articles linked above make obvious, there's considerable disagreement over what a data scientist is and where they come from. Everyone agrees that data scientists attempt to extract useful information from big data (another term open to interpretation, by the way). However, some writers focus on how data scientists approach problems; they see data scientists as researchers who bring a particular way of seeking knowledge (the scientific method, with its rigorous approach to testing hypotheses through statistical analysis of experimental and quasi-experimental designs) to a new domain. Other writers focus on the computer skills required to use the tools that data scientists use; they see the data scientist as an improved version of the traditional data analyst, adding new tools such as MapReduce and Hadoop to the traditional skill set of SQL, Java, C++, and the like. There's also a third school of thought, which I've seen in job ads, that focuses on experience in marketing research; individuals with this experience often possess business degrees, though they typically have picked up some statistical and database skills as well.
Real data scientists fit all of these molds. More to the point, for an aspiring data scientist, real employers may subscribe to one vision or another, and post job ads that reflect that point of view. If you've been trained as a social scientist, odds are you don't have the 5+ years of SQL or C++ experience that some employers seek. On the other hand, even if we're talking about technical skills, someone trained as a database administrator probably doesn't know a whole lot about advanced econometrics or sampling theory. Data scientists from all three sorts of background bring useful skills to the table, and so, in the grand scheme of things, this isn't a matter of who's right and who's wrong. However, you need to find the employers and job postings that match the background you have, or at least, to make the case to an employer that you have the skills and aptitude to do the job.
Job Postings
Yeah, I know, this is the section you really wanted to see. Logically, jobs are something you should look at after preparing yourself with all the other resources linked below, but I'll concede to reality here. :)
For tech jobs in general, there's Dice.
More specialized sites:
DataJobs
AnalyticTalent
KDnuggets Jobs
Of course, you'll find more job postings on, say, Indeed, which aggregates ads from many different sites, including Dice (I'm not sure if it includes DataJobs, AnalyticTalent, or KDnuggets as well). The advantage of the more specialized job sites is that you can use more inclusive search terms without being overwhelmed by the numbers of hits. That being said, the search terms I've found most useful on Indeed are "'social science' research" (note the quotes), "statistician", and, obviously, "'data scientist'" (note the quotes again); "'data analyst'" tends to produce mostly jobs for traditional data analysts, specializing in computer skills.
For a different, and quite interesting, approach, try Hired, a headhunting firm that specializes in tech jobs, including data scientists; have a look at this article in Forbes for a full description. The catch is that Hired picks fewer than 10% of its applicants as candidates for its employer clients—but it promises 5 to 15 offers within a week for those who make the cut.
Professional Associations and Social Network Groups
The recently formed Data Science Association is the first professional association specifically devoted to the field. Yearly membership is free "for a limited time". The association holds events from time to time, but the website doesn't list any for 2017. The site features a decent online library, a weekly list of data science news stories, and, of considerable importance for the professionalization of the field, a code of conduct.
Three active LinkedIn groups deal with data science and big data:
Big Data / Analytics / Strategy / FP&A / S&OP / Strategic Planning / Predictive & Business Analytics
Data Mining, Statistics, Big Data, and Data Visualization
Research Methods and Data Science (RMDS)
The first group was established by IE. Analytics, and is the biggest of the three. The second group, while smaller, seems to have more members from outside the U.S. There's naturally some overlap between the two in content and membership. The third is the smallest of the three, and seems to lean a little towards academic topics of discussion.
Group members do a great job of posting links to the latest articles in the field, and of course, the groups are wonderful for social networking. The unstructured nature of LinkedIn group discussions, though (think Twitter without hashtags) can make it hard to look for information on specific topics, and links to popular articles are often posted several times by different group members, and in all three groups. I also find the job listings less than useful: those in the first group are not necessarily focused on data science, while those in the second, though more relevant, are also pretty small in number, and the third group has only a handful.
Also, the "Big Data" group is moderated, and seems to suffer a bit from strange moderation decisions that sometimes see informational posts placed in the little-viewed "Promotions" section rather than "Discussions" (a fate not shared by IE. Analytics' own promotions). Moroever, the presence in the group of several female employees of IE. Analytics who tend to "Like" practically every post seems reminiscent of the pharmaceutical companies' former practice of hiring ex-cheerleaders to sell drugs to (mostly male) doctors.
The Data Science community on Google+ also covers the topic, though it has only a handful of members.
Blogs
This list is very much incomplete, and I would certainly appreciate suggestions of additional blogs I can add.
We'll start with a site that aggregates multiple big data blogs, planet Big Data.
Ryan Swanstrom's Data Science 101, much like this blog, seeks to help those who want to become data scientists.
Gil Press's What's the Big Data? offers a wealth of current information on the field, with sections devoted to events, startups, interviews, and courses and graduate programs, in addition to the blog itself.
Academics may take particular interest in Zero Intelligence Agents, the blog of Drew Conway, co-author of Machine Learning for Hackers (see "Free and Cheap Learning Resources", formerly "Self-teaching Resources", below). This blog hasn't seen a new post since 2014, but contains good examples of the application of data science methods to practical problems, and, especially, creative visualizations of the results.
The eponymous blog of Conway's co-authur, John Myles White, also makes an interesting read.
The MIKE 2.0 blogs explore a wide variety of topics. I find Phil Simon's posts particularly interesting.
Carl Anderson's blog p-value.info features "[m]usings on data science, machine learning, and statistics". Anderson addresses these topics from a practical perspecive, sometimes even including code in his posts.
The anonymous BInalytics blog focuses mainly on technical subjects, but also features some big-picture posts.
Noam Ross's eponymous blog offers quite a bit of advice on using R, and also some examples of the author's own research.
Jenna Dutcher writes the datascience@berkeley Blog for Berkeley's Data Science program. The blog features short commentaries on interesting articles, books, and videos in the data science field, and links to the original works.
Jeff Leek and Roger Peng, two of the three Johns Hopkins biostatistics professors who teach Coursera's new Data Science "specialization", collaborate with Rafa Irizarry on the Simply Statistics blog. The authors have promised to feature top students from the specialization in the blog.
Tommy Jones' Biased Estimates covers a variety of data science topics, including good coverage of the goings-on of Data Community DC, which unites data scientists and their ilk in the Washington, DC area.
Free and Cheap Learning Resources (aka, Self-teaching Resources)
Odds are, if you're a social scientist working to become a data scientist, you're going to have to teach yourself quite a bit. Of course, if you've got a PhD, you're pretty smart to begin with, and the fact that you want to become a data scientist suggests that you're pretty technically savvy; the upshot is that you could learn all of the things you need to know pretty quickly on the job. Unfortunately, most employers don't think like that, and write job ads as if they think potential employees are incapable of learning once hired. This is not entirely stupid, as an employer can be sure that someone who already knows a given skill will be able to use it, without having to worry about how well that employee can learn new ones, but it does unnecessarily filter out a lot of people who might be very useful—and given the speed at which data science and its associated computer applications are evolving, anyone who can't learn new skills quickly is not going to be very useful anyway.
In any event, the reality is that the more you know going in, the more employable you'll be. You might not have years of experience, but you can easily teach yourself enough to pass certification exams, and by teaching yourself you can learn more cheaply and, usually, more quickly than if you took a formal course. Below, I've assembled a list of free or cheap learning resources (if you do take certification exams, they should be your main expense), and I hope to add to this list as time goes on.
The best place to start, in my opinion, is Stanford University's Databases (formerly "Introduction to Databases"), taught by Jennifer Widom and offered on the OpenEdX platform. This was originally a "massive open online course" (MOOC)—that is, thousands of people took it together, during a defined time period—but Stanford now offers the course material as a series of 14 self-paced mini-courses.
I was initially skeptical about this course, figuring that any broad survey of the field would touch on each topic too lightly to be of any pratical use. It turns out that I was wrong: yes, the course is an introductory one, and it covers a lot of topics, but on many of those topics, it goes into greater depth than specialized tutorials found elsewhere online, and it's particularly strong on SQL and database theory. The course provides exercises that pose real challenges, and presents them via an interactive platform that helps students to correct and learn from mistakes. Widom touts the course as being suitable for "a la carte" learning (hence the 14 mini-courses), but a novice will find all of the topics useful. You can see my full review of both this course and the Big Data University SQL course mentioned below here.
For me, one of the most useful resources for learning data science topics has been Coursera, which offers free MOOC's from well-known universities. Yes, there are plenty of tutorials and self-paced courses out there, but I find the deadlines provided by a MOOC to be a useful way to keep myself on track—even if there are no real consequences to missing them. Most of these courses are not for college credit, you may have to wait a while until the course you need starts, and many of the courses are surveys (including two courses on big data, Introduction to Data Science and Web Intelligence and Big Data), rather than focusing on more specific, practically useful topics, but Coursera promises a degree of academic rigor not found in the average online tutorial, as well as extra features such as machine- and peer-grading. The fact that co-founders Andrew Ng and Daphne Koller research machine learning bodes well for offerings related to data science.
The anonymous author of the BInalytics blog has identified a number of Coursera courses on data science topics that teach their students to program in R as part of the curriculum: Statistics One (probably not of much use to a social scientists with a quantitative background), Data Analysis, Computing for Data Analysis, Social Network Analysis, Mathematical Biostatistics Boot Camp, and Introduction to Computational Finance and Financial Econometrics. For other languages, he recommends Computational Methods for Data Analysis (MATLAB), Probabilistic Graphical Models (Octave/MATLAB), Passion Driven Statistics (SAS), and Computational Investing (Python); I should add that Andrew Ng's own Machine Learning course also uses Octave. The opportunity to study a substantive data science topic while learning a programming language at the same time is a good two-for-one deal.
Coursera also offers (non-programming) courses on business, which is an area that those interested in becoming data scientsts should not neglect. BInalytics mentions Financial Engineering and Risk Management as being potentially useful to data scientists, and I took the almost entirely non-technical, but quite interesting, Foundations of Business Strategy.
Recently, Coursera has moved toward paying for extra features or, in a few cases, toward courses that cannot be taken for free. The capstone course of Coursera's new Data Science specialization falls into this category, but the other courses in the specialization can still be taken for free. For more details, see below under "Paid Courses, Certificates, and Degrees". In addition, check out the specialization's GitHub repository to find copies of the courses' lecture notes, which make up an excellent reference source for anyone using R to solve data science problems.
In the same vein as Coursera are Udacity and edX, the latter of which, like Coursera, offers courses in partnership with pretigious universities. At the moment, both services have more limited offerings than Coursera, but their catalogs are growing quickly, and Udacity offers a short course on Hadoop, a topic not covered by Coursera. In the beginning, their programming courses tended either to be very basic or to survey their topics at a general, conceptual level, but these offerings will be useful to those without a lot of programming experience, and their newer offerings include more advanced courses (Udacity also has one business course). Udacity's courses, it should be noted, are not stricly speaking MOOC's—like an online tutorial, you work at your own pace. However, for a fee, Udacity offers some of the interactivity that comes with a MOOC—see below, under "Paid Courses, Certificates, and Degrees".
Stanford University has adopted a policy of offering online classes through a variety of different outlets, including Coursera. Stanford's in-house efforts began with Class2Go, which offered three courses, including Widom's Introduction to Databases, which I described above. The University's latest effort uses the open-source OpenEdX platform; a growing list of courses began with an iteration of Databases, as well a new offering called Statistical Learning. The latter course, taught by Trevor Hastie and Rob Tibshirani, seems to cover much the same ground as Ng's Machine Learning, but using R rather than MATLAB/Octave. Hastie and Tibshirani (with Gareth James and Daniela Witten) co-authored a book on the same material, An Introduction to Statistical Learning, with Applications in R, and in conjunction with the course, the book's publisher made a PDF version available for free, to members of the public as well as students in the course.
Stanford also offers free courses through iTunes U, including a version of Ng's Machine Learning.
Finally, both KDnuggets and Udemy list a few free online courses among many more paid ones.
Big Data University offers a number of useful courses, including free courses on SQL, Java, Pig, Hive, and Hadoop (Remember, it's fun to say "Hadoop"!)—most of the courses are free, but very introductory. These are not MOOC's, but rather self-directed, at-your-own-pace tutorials. Big Data University's first SQL course is adequate for introducing the fundamentals of the query language, and also covers a lot of database theory, but the Stanford course is better on both scores, and the Big Data course has nothing to compare to the interactive exercises in the Stanford offering. One merit of the Big Data University course is that gives the student practical familiarity with setting up and using a common SQL package, IBM's DB2 Express-C, which can be downloaded for free. Another merit of the SQL course is that it's offered in Polish, Portuguese, Russian, and Spanish, as well as English; a second course is offered in English, Portuguese, and Spanish.
MySQL Tutorial, as one might expect from the name, features many useful tutorials on Oracle's popular open-source (and free) database program. Don't be put off by the writer's questionable grammar (he's obviously not a native speaker) or sometimes odd organization.
Download MySQL and the MySQL Reference Manual at the MySQL Developer Zone.
Dubois, Hinz, & Pedersen's MySQL 5.0 Certification Study Guide (available from both Amazon and Barnes & Noble—there are also Kindle and Nook editions) comes highly recommended, once you've learned MySQL and are ready to take the Developer exams; its biggest selling point is that it was written by the authors of the exams. (In the interest of full disclosure: yes, those are affiliate links.)
Oracle's New to Java Programming Center provides a good start with that language, including tutorials.
Code Academy offers a number of short, interactive tutorials that cover each of several programming languages, including Python.
Kevin Sheppard's Introduction to Python for Econometrics, Statistics and Data Analysis, a free ebook, provides a guide to the popular scripting language that's especially relevant for our purposes. Note that there's an eariler, incomplete version of the book on the web—make sure to use this link for the most recent version.
For learning R, Data Camp offers a small but growing collection of interactive tutorials, two of which constitute the programming exercises of Coursera courses: Eric Zivot's Introduction to Computational Finance and Financial Econometrics and Mine Çetinkaya-Rundel's Data Analysis and Statistical Inference. (The statistics taught in the latter course are probably a little basic for anyone coming from a social science background—the most advanced topic is multiple regression.) Taking the Coursera courses signs you up for their Data Camp components, but you can also take the Data Camp courses by themselves, whether or not the Coursera courses are being offered at the time. Data Camp also plans offerings with Revolution Analytics and RStudio in the near future.
The BInalytics blog recommends Jones, Maillardet, and Robinson's slightly pricey textbook Introduction to Scientific Programming and Simulation Using R, available from Amazon (and also in a Kindle version) and Barnes & Noble (there's no Nook version available).
Conway and White's Machine Learning for Hackers insists it's not a guide for learning R, but you wouldn't be the first person to use it as a way to learn R while also studying machine learning; it has an accompanying website with code samples and other goodies. You can of course get it at Amazon or Barnes & Noble; both Kindle and Nook editions are available, and O'Reilly also offers an upgrade option to receive updates and non-DRM copies (for the Nook edition, at least, which is the one I own).
In a similar vein is Torgo's Data Mining with R: Learning with Cases Studies, which the author of the BInalytics blog recommends highly for its case approach, though he does find it less challenging than Conway and White's book. It too is available at Amazon (and also in a Kindle version) and Barnes & Noble (again, there's no Nook version).
A free alternative for studying both machine learning and R is James, Witten, Hastie, and Tibshiranti's An Introduction to Statistical Learning, with Applications in R.
The R Project for Statistical Computing is a source for all things R.
RSeek can help you find additional information on a language whose name gives search engines fits.
Code School offers a very basic but very accessible online course called Try R, complete with a pirate theme and badges for completing each chapter. Completing the course takes only a couple of hours, at most, and on completion you'll be offered discounts on O'Reilly ebooks (50%) and print books (40%). A course on Ruby and a zombie-themed course on Ruby on Rails might also be relevant to a data scientist. All three of these are free, but Code School also offers paid courses.
You might also check out two posts by Noam Ross, one a recount of a talk on debugging tools in R, and the other a very practical seet of recommendations for speeding up R code.
For using R with big data, BInalytics recommends a series of tutorials posted by Jeffrey Breen on his Things I Tend to Forget blog. These tutorials make use of the RHadoop packages published by Revolution Analytics.
For those interested in more advanced topics, a list of free e-books can be found in Carl Anderson's blog p-value.info.
In a similar vein, Ryan Swanstrom has posted a list of free data science e-journals on his Data Science 101 blog.
Machine Learning Surveys bills itself as a "list of literature surveys, reviews, and tutorials on Machine Learning and related topics". As of January 2013, it listed 123 resources.
Paid Courses, Certificates, and Degrees
Mind you, there's also something to be said for taking formal classes: you've got something to put on your resume, you don't have to worry as much about motivating yourself, and you might even be able to get a recommendation from a teacher impressed by your aptitude. On top of all that, many courses include certification exams for "free". If you can afford the time and money for classes, they might well be a good option.
KDnuggets also lists a wide range of short training courses and university programs, as well as featuring sections on software, news, conferences, and even publicly available datasets, among a number of other things.
Udemy offers many paid but inexpensive (and occasionally free) courses on subjects relevant to data science. The site is notable for offering several, competing courses on many topics, with user reviews to help you make choices. One of my readers has recommended this $59 course on Java, though Udemy has a number of other offerings on that programming language.
Both Coursera and Udacity have moved recently toward pay models. Reports of cheating have darkened Coursera's reputation somewhat. Its "Signature Track" purportedly seeks to address this issue by requiring stringent identification procedures (including typing style detection) and a fee of $30-$100 in exchange for a "verified" certificate that can be shared with a college or employer (this is currently available only for a few courses, though you can still take those courses in the normal, free way).
Given the doubtful efficacy of these identification measures (they wouldn't, for example, prevent a student from uploading an assignment file created by someone else), whether employers will place any stake in these verifiable certificates (as opposed to a simple line on a resume listing the course) remains an open question; their real value at present probably lies in a concurrent effort that has obtained American Council on Education (ACE) Collge Credit Recommendation Service (CREDIT—no, I don't know how that acronym works) recommendation for some of the introductory-level courses, a step that will allow students to gain college credit, for an additional fee assessed for an online Credit Exam. However, most readers of this blog are probably not undegraduate students taking introductory courses, and it's doubtful that credit will be offered by universities for the more advanced courses of interest to an aspiring data scientist (indeed, there are several courses offered with a Signature Track that aren't among those recommended by ACE CREDIT).
More recently, Coursera has begun to offer paid tutoring through Google's Helpouts (I'm not clear on whether Coursera is getting any revenue from this).
In addition, the company has begun to offer "specializations", each of which feature a series of short courses, followed by a capstone project of some sort. If you take all of the coures on the Signature Track, and then finish the capstone project, you'll receive a certificate for the Specialzation. For example, the new Data Science specialization, offered in conjunction with Johns Hopkins, includes nine short courses and a capstone, each for $49, for a total of $490, including the opportunity to retake failed courses for up to two years; the first iteration will run from Apr. 7 to roughly the end of July. All of the courses are also available for free, but you can't enroll for the capstone project unless you pay for the entire specialization.
The Data Science specialization, which uses R and Git throughout, looks to be fairly useful, but much of the specialization concentrates on statistics and good scientific practice, which, while quite valuable, are already familiar to most social scientists. It also has little coverage of databases, other than how to read data from them using R. On the other hand, the Getting and Cleaning Data course offers to impart some very practical skills, such as using API's and web scraping to extract data from the web; it's also the course that covers extracting data from databases.
Udacity offers its at-your-own-pace courses for free, but for a third of its courses, including the ones most useful for aspring data scientists, offers extra services (called "Full Courses"), such as coaching and product feedback, for a "subscription" fee, typically on the order of $150 per month per course (meaning that the faster you finish the course, the less you pay).
Similar to Coursera's specializations, Udacity has recently announced four upcoming "nanodegrees". The Data Analytics nanodegree, which is "[p]roduced in collaboration with" AT&T, Clouderea, Facebook, and MongoDB, is likely aimed at the same audience as Coursera's Data Science specialization. Nanodegrees will take 6-12 months to complete, at a cost of $200 a month.
Even more intriguing is Coursera's Master of Science in Computer Science, offered in conjunction with Georgia Tech and AT&T. The selection of courses on offer is more limited that that found on campus, but the cost, at under $7,000, is about a third of what they'd pay for the on-campus version. Students must gain admittance to Tech (coincidentally, my own alma mater), and the degree program will soon be taking applications for Fall 2015. I'm not clear on why applications have to be taken nearly a year in advance, and I've certainly seen other online programs that don't require that much lead time for applicants.
The Georgia R School offers 10 courses on R for $95 each, with monthly and yearly memberships available. There's a discount for a students, and the school offers a 14-day free trial.
Cloudera offers not only online courses, but certifications as well, for its Apache Hadoop platform.
An increasing number of universities offer master's degrees or graduate certificates in data science, data mining, or business analytics. These problems offer the allure of a solid credential, as with business degrees, costs tend to be high, and financial aid (other than student loans) scarce.
Doug Henschen of Information Week has catalogued what he regards as the top 20 master's degrees in the field, with a mention of 10 additional programs, and the promise of more to come. Many of the universities offer part-time and/or online curricula, and many offer (shorter, cheaper) certificates as well as degrees.
Probably the most complete list (with over 200 programs) is Ryan Swanstrom's Data Science Colleges, spun out of his Data Science 101 blog. Other lists can be found in Gil Press's What's the Big Data? blog and on the homepage of North Carolina State University's Insitute for Advanced Analytics.
You might also consider applying for the Insight Data Science Fellows Program, especially if you're a PhD candidate or new graduate who wants to work in the Bay Area of California. This is a six-week program that's project-based (rather than classroom-based) and includes mentoring and interviews with top Silicon Valley companies.
Data Sources
Do you find it hard to motivate yourself to practice your skills on canned exercises? If so, download some real-world data and get your hands dirty.
One of the easiest's ways to access large amounts of data is Quandl, which has collected over 10 million economic and social datasets from over 500 sources from around the world. You can browse data in Quandl's web interface, and download it either in standard formats or through an API supported by libraries and plugins for a wide variety of languages and applications. Data Camp (mentioned above, under "Free and Cheap Learning Resources") offers a course called "How to Work with Quandl in R".
Below, you'll find a list of data sources that I assembled some time ago. Since then, I've found a list that puts mine (and probably anyone else's) to shame: Jeffrey Leek provides it in the last lecture of his Getting and Cleaning Data course on Coursera, part of the Data Science specialization. You can find his long list of links in the notes for that lecture, which is titled "Data Resources". The notes, incidentally, are an HTML5 document—you'll need to use the "Page Down" button to scroll through them. You can access theese without signing up on Coursera or joining the class, though you would have to do so to watch the lecture itself.
One problem with finding data is that many websites that provide public access to databases do so only through web interfaces that, while often sporting impressive visualization tools, don't allow for serious statistical analysis (probably by design—organizations often have no desire to give up proprietary data for free). If you really want to analyze data, you need to be able to download an entire dataset (or a subset of it). All of the sites below allow free downloads of databases in one form or another. Incidentally, I do need to note that I shamelessly took the first three of these sites from Gil Press's October, 2012 article on Foreign Policy's website, "10 Big Data Sites to Watch".
The U.S. government's Data.gov offers approximately three zillion datasets for your analytical pleasure (well, actually, close to 400,000, but that's still more than you can examine in your lifetime).
DataMarket offers an intriguing variety of both government and industry data. As far as I can tell, datasets can be downloaded only through an API (at least, for free users), rather than in ASCII or Excel form, but data from different sets can be combined, and, quite usefully, DataMarket provides links to the providers of the data, making alternate methods of download possible in many cases.
As mentioned above, KDnuggets lists a number of sites with publicly available datasets. These include images, blog posts, and even songs.
The U.S. Census Bureau provides a variety of demographic and economic data. The website's organization leaves something to be desired, but look here, here, and here for downloadable datasets.
The UCI Machine Learning Repository, maintained by UC Irvine's Center for Machine Learning and Intelligent Systems, houses 235 datasets as of January 2013.
If you're interested in politics, try the data from the American National Election Studies, widely considered the most important series of surveys of U.S. voters.
If health is more your cup of tea, the National Center for Health Statistics, part of the Centers for Disease Control and Prevention, provides an interesting range of data on its FTP server.
Contests
Many devotees of data science swear by public competitions as a mean to hone their craft, as well as to gain public notice that might lead to employment. This is obviously not really an option for a beginner, but the thrill of competition can certainly provide good motivation for learning.
Kaggle, of course, is the best-known host of data competitions. At any given time, the company generally hosts a handful of "real" competitions (that is, those that give money prizes, and count for Kaggle ranking points and tiers, plus a larger number of "Knowledge" competitions intended for practice. The Knowledge competitions may award only bragging rights, but they're accessible to individuals or small groups with little experience in data science.
Innocentive hosts challenges from a broad array of disciplines. Many of these problems are tractable to data science.
TunedIT hosts competitions (it listed 32 as of January 2013) as a means of promoting its data mining platform.
This blog is a really helpful resource for people who wish to become data scientists. Good pieces of advice. Keep writing!
ReplyDeleteKaggle, of course, is the best-known host of data competitions. At any given time, the company generally hosts a handful of "real" competitions (that is, those that give money prizes, and count for Kaggle ranking points and tiers, plus a larger number of "Knowledge" competitions intended for practice.
DeleteImage Processing Projects For Final Year Students
Machine Learning Projects for Final Year
Deep Learning Projects for Final Year
The Knowledge competitions may award only bragging rights, but they're accessible to individuals or small groups with little experience in data science.
Someone needing soe data sets to experiment with could check out the USA CDC (Centers for Disease Control and Prevention) link below. You will haave to first a
ReplyDeleteCDC/National Center for Health Statistics
-----------------------------------------
The data files can be found here:
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/
The terms (i.e. the "Data Access - FTP - Data Users Agreement") and conditions under which the data sets are being made freely available to you are found here:
http://bit.ly/xecvTS
I think you might have pasted the wrong URL to the Data Users Agreement, but there's a link to that agreement on the Center for National Health Statistics page mentioned above.
DeleteThe correct URL for the UA is: http://1.usa.gov/XUp5Na
DeleteCode School - Try R ( http://bit.ly/UA7DbU )
ReplyDelete"R is a tool for statistics and data modeling. The R programming language is elegant, versatile, and has a highly expressive syntax designed around working with data. R is more than that, though — it also includes extremely powerful graphics capabilities. If you want to easily manipulate your data and present it in compelling ways, R is the tool for you."
Link: http://bit.ly/UA7DbU, http://tryr.codeschool.com/
Oh, that's cute, complete with little badges for completing the chapters. I'll add it.
DeleteHi Scott,
ReplyDeleteI came across your blog page here. Great list of resources!
I work at Lavastorm Analytics and big data analytics is what we do. We are always looking to hire. Would you mine posting our careers page as a resource for for this page - http://www.lavastorm.com/company/careers/
Thanks,
John
Sure—I've just added the link. I'm a little curious, though: Lavastorm is a producer of business intelligence/analytics software, which is something that would seem to call for people more on the programming than on the social science side of data science. In that case, why the interest in a blog like mine?
DeleteHi, Scott,
ReplyDeleteLove the list of blogs you've got going here! Data Science 101 is one of my personal favorites. I work with the datascience@berkeley Master of Data Science degree program, and we have a blog that features interviews from thought leaders in the field as well as some background on different projects and areas that might be of interest to your readers. Would you mind adding the link to your list, in case your visitors want to check it out? The blog can be found at: http://datascience.berkeley.edu/blog
Keep up the great work!
Thanks,
Jenna
It took me a while to update this page, but I've added a link to your blog now. :)
DeleteHi, Scott,
DeleteDropping you a line again to share that we recently published a project called "What is Big Data?" which compiles a comprehensive list of "Big Data" definitions from 40+ thought-leaders in the data science field. We feature people like Hilary Mason, Drew Conway, Hal Varian, Gregory Piatetsky, and many others. We knew the term was vague so we figured the best way to get a handle on it was to ask those who are immersed in the field.
You can check it out here: http://datascience.berkeley.edu/what-is-big-data/
This seemed very much in line with what you write about on your blog, so I wanted to make sure you were aware of it. :) Hope you enjoy.
Thanks,
Jenna
I will be sure to add that. Thanks!
ReplyDeleteSorry for the late reply, but I don't update this page as regularly as I should. While a computer science background can indeed by useful for data scientists, I'm not sure I see why someone entering the data science field would pick this degree, rather than one that's focused on data science topics. In checking a couple of the lists linked from this page, however, I noticed the Syracause does ofter a certifcate program in data science.
ReplyDeleteHi. I work at CrowdAnalytix and we host data science contests too. Would you mind adding our site and information?
ReplyDeleteinformatice indeed
ReplyDeleteAnother addition could be SIRE Life Sciences, https://sire-search.com. Job postings & recruitment.
ReplyDeleteIf you are interested in working on data science datasets that can help refine your methods and techniques you may want to investigate http://societyofdatascientists.com/datasets/ there are no restrictions on data usage.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteGood job and thanks for sharing such a good blog You’re doing a great job. Keep it up !!
ReplyDeletePython Training in Chennai | Python Training in Chennai, OMR | Python Training in Chennai, Velachery | Best Python Training in Chennai | Python Training Institute in Chennai | Best OpenStack Training in Credo Systemz, Chennai
It’s really nice information to share here. Thanks for your blog, keep posting like this regularly. Thank you
ReplyDeletedata science certification course training
Your Post is very useful,168 8099 apk scr888 casino game 4 I am truly happy to post my note on this blog . It helped me 918kiss malaysia apk with ocean of awareness so I really consider you will do much better in the future.
ReplyDeleteAwesome post. You Post is very informative. Thanks for Sharing.
ReplyDeleteR Programming Course in Noida
Wow, what a blog! I mean, you just online casino malaysia for android have so much guts to go ahead and tell it like it is. Youre what blogging needs, an open minded superhero who isnt afraid to tell it like it is. This is definitely something people need to be up on. Good luck in the future, man
ReplyDeleteI think most people would agree with your article. I am going to bookmark joker123 test id this web site so I can come back and read more articles. Keep up the good work!
ReplyDeleteVery handy blog keep blogging. Get admission in the affordable data science training in Gurgaon
ReplyDeleteEnjoyed reading the article above, really explains everything in detail, the article is very interesting and effective. Thank you and good luck for the upcoming articles machine learning training
ReplyDeleteGood Post. I like your blog. Thanks for Sharing.
ReplyDeleteData Science Training In Noida
Data science is one of the top course in todays career. Your content will going to helpful for all the beginners who are trying to find best data science training in bangalore. Thanks for sharing useful information. keep updating.
ReplyDeleteGreat Blog. Your blog contain helpful information
ReplyDeletefor people who wish to become data scientist.
Best Data Science course in Mumbai
Thanks for sharing your valuable information and time.
ReplyDeleteMachine Learning Training in Delhi
Thanks for sharing such a great blog Keep posting..
ReplyDeleteData Scientist Training in Pune
Very well. So, Top Tutor Bay is one of the top academic writing websites to provide assistance to students in their essays, research papers, assignments, term papers, dissertations, PowerPoint presentations, etc. Click now Coursework Writing Services in UK, USA, Australia
ReplyDeleteThanks for the useful links. Such a great blog !
ReplyDelete
ReplyDeleteThanks for sharing your innovative ideas to our vision. I have read your blog and I gathered some new information through your blog. Your blog is really very informative and unique. Keep posting like this. Awaiting for your further update.If you are looking for any Data science related information, please visit our website Data Science Training in Bangalore
Nice information . Your information is really useful to students who are looking for data science coures.Thanks for sharing with us.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis has been a really wonderful post.
ReplyDeleteData Science Training in Delhi
Data Science Training institute in Delhi
Thanks for sharing the valuable information.
ReplyDeletedata science online training
data science training
data science training in hyderabad
Thanks for effective post. Keep updating.
ReplyDeleteMachine Learning training in Pallikranai Chennai
Pytorch training in Pallikaranai chennai
Data science training in Pallikaranai
Python Training in Pallikaranai chennai
Deep learning with Pytorch training in Pallikaranai chennai
Bigdata training in Pallikaranai chennai
Mongodb Nosql training in Pallikaranai chennai
Spark with ML training in Pallikaranai chennai
Data science Python training in Pallikaranai
Bigdata Spark training in Pallikaranai chennai
Sql for data science training in Pallikaranai chennai
Sql for data analytics training in Pallikaranai chennai
Sql with ML training in Pallikaranai chennai
Wonderful article. Very interesting to read this article.I would like to thank you for the efforts you had made for writing this awesome article. This article resolved my all queries.
ReplyDeleteOnline Data Science Training in Pune, Mumbai, Delhi NCR
Thanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuff like this.
ReplyDeleteData Science Training in Pune
Great post I would like to thank you for the efforts you have made in writing this interesting and knowledgeable article.
ReplyDeleteData Science Training in Bangalore
Awesome blog. I enjoyed reading your articles. This is truly a great read for me. I have bookmarked it and I am looking forward to reading new articles. Keep up the good work!
ReplyDeletedata science institute in hyderabad
data analytics training in hyderabad
business analytics course in hyderabad
Very awesome!!! When I seek for this I found this website at the top of all blogs in search engine.data science course in malaysia
ReplyDeleteReally nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
ReplyDelete360digitmg data science course in india
This blog is so nice and more informative i like it more......
ReplyDeleteJoin 360DigiTMG for best Data Science Course in Hyderabad and become a professional Data Scientist in hyderabad with hands-on experience on real-time projects in just 4 months. Enhance your career with data science courses in hyderabad.
data science course in hyderabad
data scientist courses in hyderabad
data scientist course in hyderabad
data science courses in hyderabad
data science course hyderabad
data science institute in hyderabad
best data science course in hyderabad
http://newdatascientist.blogspot.com/p/useful-links.html#sthash.aOjwzILV.dpbs
ReplyDeleteThanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuff like this.
ReplyDeleteR Programming Training in Bangalore
You always try to sharing such a good information with us.
ReplyDeleteData Science Training in Noida
Data Science Training institute in Noida
Thanks for sharing such an awesome article. It helps me a lot. Business Analytics MBA are designed to help students hone their analytical and critical-thinking skills through core courses such as Macroeconomics Analysis, Market Analysis and Management Science.
ReplyDelete
ReplyDeleteNice to be seeing your site once again, it's been weeks for me. This article which ive been waited for so long. I need this guide to complete my mission inside the school, and it's same issue together along with your essay. Thanks, pleasant share.
Data Science Course In Bangalore With Placement
Wonderful blog & good post.Its really helpful for me, awaiting for more
ReplyDeletenew post. Keep Blogging! Free data visualization software Trial
Stunning! Such an astonishing and supportive post this is. I incredibly love it. It's so acceptable thus wonderful. I am simply astounded.
ReplyDeletedata science course
This is a great post I saw thanks to sharing. I really want to hope that you will continue to share great posts in the future.
ReplyDeletedata science course in noida
Very well written
ReplyDeleteBest Data Science Course In Hyderabad
Thank you for sharing the article. The data that you provided in the blog is informative and effective.
ReplyDeleteDevOps Training in Hyderabad
This comment has been removed by the author.
ReplyDeleteSet aside my effort to peruse all the remarks, however I truly delighted in the article. It's consistently pleasant when you can not exclusively be educated, yet in addition, engaged!
ReplyDelete360DigiTMG iot classes
Thanks for sharing this amazing post
ReplyDeleteWe provide Classroom training on IBM Certified Data Science at Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices. We are also available over the call or mail or direct interaction with the trainer for active learning.
data science course training in Hyderabad
Data Science Course in Hyderabad
We provide Classroom training on IBM Certified Data Science at Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices. We are also available over the call or mail or direct interaction with the trainer for active learning.
ReplyDeleteFor any queries feel free to Call/WhatsApp us on +91-9951666670 or mail at info@innomatics.in
data science course training in hyderabad
Data Science Training Course in Noida
ReplyDeleteThankyou for this wondrous post, I am cheerful I watched this site on yahoo.
ReplyDeletehttps://360digitmg.com/india/iot-course-training-in-noida
ReplyDeleteThanks for the detailed blog.The blog consisit of the informational data of what a user search.To get certification for the data science developer at the best price from the global tech council. We deliver aim to get to give a high bench service at a pocket-friendly price. Contact us now.
Certified Data Science Developer
We provide Classroom training on IBM Certified Data Science at Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices. We are also available over the call or mail or direct interaction with the trainer for active learning.
ReplyDeleteFor any queries feel free to Call/WhatsApp us on +91-9951666670 or mail at info@innomatics.in
data science training in hyderabad
Data Science Course in Hyderabad
Data Science Training Institute in Noida
ReplyDeleteThanks for the detailed blog.The blog consist of informational content about the topic.I really appreciate the blog post.YOu may also visit to the Global tech council to get the best deal.
ReplyDeleteJust click- Data science certificate online
Nice blog, it's so knowledgeable, informative, and good looking site. I appreciate your hard work. Good job. Thank you for this wonderful sharing with us.data science course in Hyderabad
ReplyDeleteIncredibly conventional blog and articles. I am realy very happy to visit your blog. Directly I am found which I truly need. Thankful to you and keeping it together for your new post.
ReplyDeletedata science course malaysia
This comment has been removed by the author.
ReplyDeleteThanks for sharing this Information. Data Science Training Institute in Gurgaon
ReplyDeleteThank you for sharing the article. The data that you provided in the blog is informative and effective.
ReplyDeleteServicenow Training in Hyderabad
This comment has been removed by the author.
ReplyDeletewow really superb you had posted one nice piece of information through this. Definitely, it will be useful for many people. So please keep update like this.
ReplyDeleteData Science Training Pune
Wow that was odd. I just wrote an incredibly long comment but after I clicked submit my comment didn’t show up. well I’m not writing all that over again. Regardless, just wanted to say superb blog!
ReplyDeletedata scientist training and placement
nice blog!! i hope you will share a blog on Data Science.
ReplyDeletemachine learning course aurangabad
It was a good experience to read about dangerous punctuation. Informative for everyone looking on the subject.
ReplyDeletedata scientist course in hyderabad
I am sure it will help many people. Keep up the good work. It's very compelling and I enjoyed browsing the entire blog.
ReplyDeleteBest Data Science Courses in Bangalore
check our blog MBA in Artificial Intelligence if anyone having a keen interest in artificial intelligence
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteI am delighted to discover this page. I must thank you for the time you devoted to this particularly fantastic reading !! I really liked each part very much and also bookmarked you to see new information on your site.
ReplyDeleteBusiness Analytics Course
Thanks for sharing this awesome blog with us. Good content and informative content.
ReplyDeleteOnline Data Science Course in Hyderabad
Onine Artificial Intelligence Course in Hyderabad
Online Machine Learning Course in Hyderabad
Online Python Course in Hyderabad
This comment has been removed by the author.
ReplyDeleteThis post is very simple to read and appreciate without leaving any details out. Great work!
ReplyDeletedata science courses in chennai
Nice work, truly valuable to me.
ReplyDeleteI hope you keep it up.
We are offering best offshore development services then,
Visit here:
Iyrix Technologies
Remote Software Developers
Software Development Services
It's like you've got the point right, but forgot to include your readers. Maybe you should think about it from different angles.
ReplyDeleteBest Cyber Security Training Institute in Bangalore
nice post ,thanks for sharing nice blog if like to read more visit it https://duckcreektraining.com/
ReplyDeleteVery good message. I came across your blog and wanted to tell you that I really enjoyed reading your articles.
ReplyDeleteArtificial Intelligence Courses in Bangalore
really nice and informative article.
ReplyDeleteAt Ali’s Academy, we provide a unique learning environment which drives sustained academic success and personal growth. Our tuition methods are based on Engaging and Empowering students in order to deliver sustained academic out-performance.
We offer extra curriculum Course for SATS and GCSE Exams, Bespoke 1-2-1 sessions tailored for each student and 11 Plus Exams Preparation in Slough and High Wycombe, UK.
Ali’s Academy is OFSTED registered. This allows our members to take advantage of savings on fees through numerous government support schemes.
Thanks for sharing your precious time to create this post, it's so informative, and the content makes the post more interesting.really appreciated. Camille Razat Emily In Paris S02 Blazer
ReplyDelete"Thank you very much for your information.
ReplyDeleteFrom,
"
ai training in aurangabad
It's like you've got the point right, but forgot to include your readers. Maybe you should think about it from different angles.
ReplyDeleteData Science Course in Kolkata
Thanks for sharing this article that will help beginners who want to start their career as a Data Scientist and also visit The best Data Science Training Course in Delhi for Training with placements assurance.
ReplyDeleteReally an awesome blog and very useful information for many people. Keep sharing more blogs again soon. Thank you.
ReplyDeleteOnline Data Science Course in Hyderabad
All things considered I read it yesterday yet I had a few musings about it and today I needed to peruse it again in light of the fact that it is very elegantly composed.
ReplyDelete360DigiTMG, the top-rated organisation among the most prestigious industries around the world, is an educational destination for those looking to pursue their dreams around the globe. The company is changing careers of many people through constant improvement, 360DigiTMG provides an outstanding learning experience and distinguishes itself from the pack. 360DigiTMG is a prominent global presence by offering world-class training. Its main office is in India and subsidiaries across Malaysia, USA, East Asia, Australia, Uk, Netherlands, and the Middle East.
ReplyDeleteThank you once again for your love and willingness to share your feelings
ReplyDeleteSEO Firm Chicago
Digital Evrima
Data Science Training in Noida
ReplyDeleteData Science Institute in Delhi
ReplyDeleteAt first definition of both separately, and then you will read the difference between R and Python. So let's dive into it.data science course in dombivli
ReplyDeleteThanks for posting the best information and the blog is very important for us. Please check here.
ReplyDeleteData Science Course in Noida
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThank for sharing such informational blog. To study abroad you need to start with IELTS course. If you want to fulfill your dream of studying abroad dreams.
ReplyDeleteWithout data analytics, you cannot imagine data science. In this process, data is examined to transform it into a meaningful aspect.
ReplyDeletedata science course in patna
Data Science Course in Noida
ReplyDeletehttps://aptronsolutionsblog.mystrikingly.com/blog/data-science-certification-for-it-leaders-looking-to-get-ahead-aptron-solutions
hey! This 먹튀검색
ReplyDeletehello! I may 먹튀검증하는곳
ReplyDeletehi there! 먹튀검증추천
ReplyDeletemay also i abs 토토마트 토토검증업체
ReplyDeleteNice thanks for sharing informative post like this keep posting if like more details visit my website sclinbio.com
ReplyDeleteUpgrade your career clinical data mange ment from industry experts gets complete hands on servicess, on our sclinbio.
ReplyDelete
ReplyDeleteThank you for sharing! I always appreciate engaging with high-quality content that offers valuable insights. The presented ideas are not only excellent but also incredibly innovative, making your post a truly enjoyable read. Keep up the fantastic work.
visit:
HTML Block & Inline
HTML Classes
HTML Id
HTML Iframes
HTML JavaScript
HTML File Paths
HTML Head
HTML Layout
HTML Responsive
HTML Computercode
HTML Semantics
HTML Style Guide
HTML Entities
HTML Symbols
HTML Emojis
HTML Charset
HTML URL Encode
HTML vs. XHTML
HTML Forms
HTML Forms
HTML Form Attributes
HTML Form Elements
HTML Input Types
HTML Input Attributes
HTML Input Form Attributes
HTML Graphics
HTML Canvas
HTML SVG
HTML Media
HTML Media
HTML Video
HTML Audio
HTML Plug-ins
HTML YouTube
HTML APIs
HTML Geolocation
HTML Drag/Drop
HTML Web Storage
HTML Web Workers
HTML SSE
HTML Examples
HTML Examples
HTML Editor
HTML Quiz
HTML Exercises
HTML Website
HTML Bootcamp
HTML Certificate
HTML Summary
HTML Accessibility
HTML References
HTML Tag List
HTML Attributes
HTML Global Attributes
HTML Browser Support
HTML Events
HTML Colors
HTML Canvas
HTML Audio/Video
HTML Doctypes
HTML Character Sets
HTML URL Encode
HTML Lang Codes
HTTP Messages
HTTP Methods
PX to EM Converter
Keyboard Shortcuts
HTML Links
Links are found in nearly all web pages. Links allow users to click their way from page to page.
HTML Links - Hyperlinks
HTML links are hyperlinks.
You can click on a link and jump to another document.
When you move the mouse over a link, the mouse arrow will turn into a little hand.
Note: A link does not have to be text. A link can be an image or any other HTML element!
HTML Links - Syntax
The HTML tag defines a hyperlink. It has the following syntax:
Advanced Python Techniques: Unleash the Power of Python
Nice article
ReplyDeletevba macros course
advanced excel course
power bi course in hyderabad
microsoft office essentials course
advanced excel course in hyderabad