Thursday, November 29, 2012

My Next Steps

Once I had decided I wanted to become a data scientist (see my previous post), I had to figure out exactly how to accomplish that.

My first step was figuring out what I needed to learn. I know statistics, and I'm good at writing and speaking; that meant I needed to concentrate on programming languages (if you want to call some of these "scripting languages" or "query languages", go right ahead). Reading a few articles on the subject gave me a good idea of what languages data scientists use most. After conversations with a couple of computer scientists, I decided that my first priorities would be Java and Python (for application programming), R (for statistical programming), and SQL (for database queries). Why not learn C++? Knowing it would be nice, but, judging by job ads, Java, Python, and R will see more use, and I had to set priorities.

Where to learn these languages? I really didn't have much money to spend (I am, after all, unemployed), which limited my options. I briefly considered applying for training through the Workforce Investment Act, but most of the programs available in my area focused on web development or database administration, most of them would have taken too long to finish, and, at any rate, paying for additional education for someone who already has a Ph.D. was going to be a hard sell. I therefore started to explore resources for teaching myself.

For SQL, I initially picked MySQL, because it's both popular and free, and I chose an online tutorial; however, I then discovered a free SQL course at Big Data University, which employs DB2 Express-C. While this course essentially amounts to self-teaching, it provides a little more structure than the tutorial, especially an exam at the end (you even get a certificate upon passing the exam—for whatever that's worth).

Big Data University also offers a course in Java, but this one is beta, and only about half the lessons are presently available. Still, it may be worth taking a look at. Probably more useful is Oracle's New to Java Programming Center, which features a number of tutorials.

I bought Conway and White's Machine Learning for Hackers, unsurprisingly, to study machine learning (another skill I lack), but the book turned out to be a half-decent means of learning R as well, with instructions for downloading the main application and extensions, and examples for you to follow on your own computer.

I'll let Python wait for a bit, until I've mastered the basics of Java and R. One of my friends gave me a book on the language, but Kevin Sheppard's Introduction to Python for Econometrics, Statistics and Data Analysis, a free ebook, seems more focused on what I need to learn.


Certifications?

Would it be worth it get certification for one or more languages? I've encountered mixed opinions on the subject: certifications or not, I won't have work experience, and many employers are only impressed by work experience. Certifications certainly can't hurt, but their costs are also non-trivial. Still, I've had a look at the Oracle certifications, particularly the ones for MySQL, and bought the MySQL 5.0 Certification Study Guide. After discussing the subject with my programmer friends, I've concluded that the developer certifications (as opposed to the database administrator ones) would be most appropriate.

If you, the reader, have any good ideas or know of useful learning resources, please comment below—I could benefit from your wisdom, and so could everyone else reading. In the meantime, please visit my "Useful Links" page for more information on the websites I've mentioned here.

Sunday, November 25, 2012

Why Become a Data Scientist?

Having decided teaching was not the right career for me (see my previous post), I began a career transition. I submitted a few applications before my professor job ended in May, mostly with the U.S. government (I figured that private-sector employers would want to hire someone immediately, not wait until I was available). In May, I started in earnest. I haven't had a lot of luck—not surprising given the current economy.

As it happens, my main region of study, Eastern and Central Europe, is not terribly fashionable right now, and, while my other specialization, ethnic politics, is pretty trendy, most of the government and private-sector jobs that would make use of my area and subject knowledge require an active security clearance. I've interviewed for a few jobs in those specialties, but none has panned out. I do however have considerable experience with advanced statistical methods, and also survey methods, and most of the jobs I've applied to would make use of those skills.

A couple of months ago, a friend of mine who, judging by the strange links he sends me (most involving cats), gets paid to browse random stuff on the internet, sent me a link to the infamous Harvard Business Review article that declared the data scientist to be "the sexiest job of the 21st century". He thought it was something that I could do. Judging by the clothes my wife makes me wear when we go out dancing, I'm a reasonably hip guy; I read the job description, and I thought, "Yeah, I could do that." Not only that, but working in a new field to tackle novel problems sounded like a whole lot of fun.

I therefore began to evaluate my strengths and weaknesses as a potential data scientist. Other social scientists considering careers as data scientists may share these.

Strengths:
  1. I'm a trained researcher. Davenport and Patil (the authors of the "sexiest job" article) point out how valuable scientific researchers can be in making sense of big data; articles by Press and Miller make the same point. To put it simply, we're very good at figuring out whether X really causes Y, or whether an apparent relationship between the two is either just dumb luck, or the result of the fact that Z actually causes both X and Y (yes, this is a longer version of the old saw that "correlation isn't causation"). This process involves coming up with good hypotheses, and clever ways to test them, using experimental or quasi-experimental methods.
  2. I know advanced statistical techniques. I am not a methods specialist (to me, stats are just a tool for answering interesting questions), but I've done research using structural equation modeling, time series modeling, and analysis of variance (ANOVA), and I've taught multiple regression and survey methods to undergraduates. I've been exposed to a lot more methods, and, most importantly, I the undertand how each method, and the assumptions behind it, can and can't be used to draw causal conclusions (see the above point).
  3. I have a decent technical background. I did coding on the job in college (using FORTRAN), I've used scripting languages for statistical packages (Mplus and gretl) as a researcher, and I've manipulated and cleaned survey databases (using SPSS). I've spent a lot of time with computers, even working as a tech support agent for a while, and, for fun, I've even coded in the obsscure and not-so-practically-useful OOP language MUSHcode (for my overly detailed thoughts on MUSHcode, click here).
  4. I know psychology. Like most survey researchers, I understand the pyschology of asking and responding to questions. Unlike most, I also know a great deal about social identity, which plays an important role in social networks.
  5. I have a great deal of international and intercultural experience. Not only have I conducted research in foreign countries, but the subject of my research has been ethnic politics. Conducting interviews on the topic of of intergroup relations has given me experience dealing with sensitive issues.
Weaknesses
  1. I don't know business database applications. Big data may be all about NoSQL, but the assumption is that everyone knows SQL in the first place. I'm studying MySQL right now.
  2. I don't know modern programming languages. As I mentioned above, I've coded using FORTRAN and even an OOP language, and I've used Pascal as well, but I don't know Java, Python, or R (let alone C++). I'm studying Java and R right now, and planning to tackle Python next. I should add that, based on both past experience and talking with friends in IT, I don't think that learning any of these languages will prove at all difficult.
  3. I don't have business experience. I read (or rather, listen to) the business section of The Economist, and, as a lifelong board and computer gamer, I'm a past master of strategy and resource allocation, but I have neither formal training nor work experience in business.
What's next, then? What I have to do now is to learn the things I don't know, and, in the meantime, convince potential employers that what I do know is valuable, and what I don't know I can learn. This blog will chronicle my efforts to do so, and, in the process, I hope to provide good advice to others in the same position.

Click here to see the useful resources I've found for bringing myself up to speed.

Saturday, November 24, 2012

About Me

This is not really about me. However, the purpose of this blog is to help people like me—social scientists who want to become data scientists—and describing myself and the steps I take to become a data scientist will provide a useful example.

Until recently, I was a tenure-track professor of political science at a small liberal arts college. That meant I was spending most of my time on teaching, though I never stopped doing research; the small size of the department also meant that I was a generalist, teaching subjects outside my own specialty, including research and survey methods (that is, I was the statistics teacher for the college's political science and sociology majors).

In November of 2011, I found out I was going to lose my job after that school year. The normal thing to do in such a situation would have been to find a new academic job—in fact, I wasn't the only one in this situation, and the other professor who was released found a better job. However, I had begun to doubt the wisdom of pursuing an academic career, for two reasons. First of all, I wonder about my ability as a teacher; my student evaluations, especially in the beginning, were not as good as I might have hoped, though my policy of grading hard (and, worse making students think I would grader even harder than I actually did) probably had a lot to do with this. Evaluations aside, I think, from observing my students' progress over time, that I'm a competent teacher, but I don't think I'm a great teacher, and I want to apply my talens in an area where I'll excel. (Yes, a job at a research university has always been a possibility, but I have ethical concerns about working somewhere that pays lip service to the educational mission, but rewards its faculty for research instead.)

My second concern is the future of higher education in general. Soaring costs, decreasing numbers of tenured professors, and a possible shift to online (and even internationally outsourced) courses make a job as a professor a bad bet, to my mind at least.

In addition, I have simply found the task of grading in particular to be supremely frustrating: students are ill-prepared by public education, and I tie myself in knots trying to help them overcome this, rather than just kicking the can down the road like everyone else (and the students don't necessarily appreciate being told they need to up their games, either).

So where did that leave me? I set out to find a job in the private sector of government, a job that focused more on research. That has proved a lot harder than I expected....

To see the resources I've assembled for the aspiration data scientist, please visit my "Useful Links" page.


Note

I would have linked an article on expected grades and student evaluations, but the subject is so controversial that no single article could give a full picture. Also, the best articles are generally in PDF form and posted on researchers' websites; while such reposting is acceptable under the publication agreeements of most academic publishers, these links tend to look suspicious to people checking for copyright violations.