Friday, July 11, 2014

Programming Languages for Big Data, Part 2

I mentioned the recent study on the relative speeds of programming languages to Tommy Jones, a specialist in natural language processing and fellow member of the Data Community DC, and he, being more industrious than I, dove into the code used by the authors of the paper in question. In their R code, he found gems such as a triple-nested "for" loop inside a "while" loop (instead of the much faster "apply" functions), which made the comparisons pretty useless, at least in the case of R. See Tommy's blog, Biased Estimates, for more details.

Nonetheless, it's a pretty interesting question, and I'd love to see someone who's proficient in all of the languages involved try this test again, using better code. I'm still intrigued by the very high speed of MATLAB/Octave—something that leads Andrew Ng to recommend those languages over R for prototyping—though Tommy pointed out to me that, since R is closer to being a full-featured language, it's more flexible than the former languages.

Sunday, July 6, 2014

Programming Languages for Big Data

I'm a big fan of R: it just seems intuitive to me, and there's a package available for practically any type of analysis you might want to do. I have some experience with Octave (essentially the open-source version of proprietary MATLAB) and Python (which I detest, especially with its confusing statements-masquarading-as-function-calls), but I find R easiest to work with.

Therefore, I find a new study comparing the speeds of various languages for a statistical problem pretty depressing. When looking at this kind of a study, it's important to keep one big thing in mind: the authors tested the various languages on only a single task (albeit a common task, at least in economic modeling), and different languages will have different strengths and weaknesses at different tasks.

Nonetheless, the differences in run time are so large that it's not unreasonable to draw some conclusions. Even when compiled, R takes 240 to 340 times as long to run as C++. How about Python and MATLAB? Python with the default CPython compiler is nearly as slow as compiled R (155 to 269 times), but with Pypy it reaches 1/44 of the speed of C++. MATLAB takes only about 10 times as long to run as C++, or only about 50% longer when using Mex files (C, C++, or Fortran subroutines called by MATLAB). (Octave has Oct files written in C++, which serve a similar purpose; Octave can use Mex files, but not as well as MATLAB. See the GNU documentation on the subject for details.)

Wow. The botton line is that R might not be the best choice for time-consuming applications—in other words, those that have to crunch through a lot of data, especially if the calcuations involved are complex. I had read that it's slower than the alternatives, but I had no idea that the differences were so dramatic. I really should polish my Octave skills, and, judging by many of the job ads I see, knowing some C++ would not only open up possibilities for faster-running code, but would also make me more employable.