Thursday, November 29, 2012

My Next Steps

Once I had decided I wanted to become a data scientist (see my previous post), I had to figure out exactly how to accomplish that.

My first step was figuring out what I needed to learn. I know statistics, and I'm good at writing and speaking; that meant I needed to concentrate on programming languages (if you want to call some of these "scripting languages" or "query languages", go right ahead). Reading a few articles on the subject gave me a good idea of what languages data scientists use most. After conversations with a couple of computer scientists, I decided that my first priorities would be Java and Python (for application programming), R (for statistical programming), and SQL (for database queries). Why not learn C++? Knowing it would be nice, but, judging by job ads, Java, Python, and R will see more use, and I had to set priorities.

Where to learn these languages? I really didn't have much money to spend (I am, after all, unemployed), which limited my options. I briefly considered applying for training through the Workforce Investment Act, but most of the programs available in my area focused on web development or database administration, most of them would have taken too long to finish, and, at any rate, paying for additional education for someone who already has a Ph.D. was going to be a hard sell. I therefore started to explore resources for teaching myself.

For SQL, I initially picked MySQL, because it's both popular and free, and I chose an online tutorial; however, I then discovered a free SQL course at Big Data University, which employs DB2 Express-C. While this course essentially amounts to self-teaching, it provides a little more structure than the tutorial, especially an exam at the end (you even get a certificate upon passing the exam—for whatever that's worth).

Big Data University also offers a course in Java, but this one is beta, and only about half the lessons are presently available. Still, it may be worth taking a look at. Probably more useful is Oracle's New to Java Programming Center, which features a number of tutorials.

I bought Conway and White's Machine Learning for Hackers, unsurprisingly, to study machine learning (another skill I lack), but the book turned out to be a half-decent means of learning R as well, with instructions for downloading the main application and extensions, and examples for you to follow on your own computer.

I'll let Python wait for a bit, until I've mastered the basics of Java and R. One of my friends gave me a book on the language, but Kevin Sheppard's Introduction to Python for Econometrics, Statistics and Data Analysis, a free ebook, seems more focused on what I need to learn.


Would it be worth it get certification for one or more languages? I've encountered mixed opinions on the subject: certifications or not, I won't have work experience, and many employers are only impressed by work experience. Certifications certainly can't hurt, but their costs are also non-trivial. Still, I've had a look at the Oracle certifications, particularly the ones for MySQL, and bought the MySQL 5.0 Certification Study Guide. After discussing the subject with my programmer friends, I've concluded that the developer certifications (as opposed to the database administrator ones) would be most appropriate.

If you, the reader, have any good ideas or know of useful learning resources, please comment below—I could benefit from your wisdom, and so could everyone else reading. In the meantime, please visit my "Useful Links" page for more information on the websites I've mentioned here.


  1. Like you, I´m trying to be a data scientist.... look at there are interesting things....
    I´m a spanish stadistic unemployed woman.
    Good look¡¡¡

    1. Good catch, Ana—I forgot to add that one to the list. Coursera offers free online courses from well-known universities. The trick is finding a course you need offered when you need it, but I've just signed up for this one on Python, though there's been no announcement yet on when it will start.

  2. Hi Scott,
    Stumbled upon this via one of your posts from Linkedin.
    Great stuff...and ironically I'm the other half of the coin. I have the Database/Business Intelligence and Business experience skillsets.
    I can pretty much manipulate and massage data to be represented a specific way to align it to a business case/problem. What I lack however are statistical analytical chops that will help extract the story behind the data.
    I've taken Statistics courses in undergrad and a data mining course in my current Graduate certificate...however I think i'm still struggling between bridging the gap between theory and application. What method to apply when..and why?

    Have you looked at a few courses @

    Best of luck.

    1. Udemy does indeed look like a useful site—I'll add it to my links page.

      You're absolutely right that learning what method to apply when is the hard part. It can take a lot of training and/or experience to do well, which, I think, is why scientsts are currently in demand. In all honestly, even among experienced researchers, it's common to specialize in only a few methods, and, having those hammers, to treat everything as nails, or at least to look for problems that are amenable to hammering.

      However, I think what would help someone like you the most is to take a course (or better yet, courses) in research methods: in a good course, you'd learn a little about the philosophy of science, the nature of causality, and why you choose one method over another in a given situation. Some good research methods courses don't even get very deep into statistics, and in fact, a graduate program might introduce you to the basic research issues before getting to the stats, so that when you do learn the statistical methods, their pratical applications are more obvious.

  3. I came from the stats and econometrics side of the world. Smartest thing I ever did was learn SQL (never mentioned in several years of grad school). I also did the Mysql. Took at 3 day workshop at the local community college to get a decent foundation, then worked the rest on my own. But always more to learn.

    Now diving heavy into data mining. Although I know some of the simpler trees, etc. Have lots to learn. Begun to learn Rapid-I (Rapidminer). Seems to be a nice GUI with some powerful algorithms behind it.

    That will take a good long time- then I think a little Python might be on my list.