Sunday, July 6, 2014

Programming Languages for Big Data

I'm a big fan of R: it just seems intuitive to me, and there's a package available for practically any type of analysis you might want to do. I have some experience with Octave (essentially the open-source version of proprietary MATLAB) and Python (which I detest, especially with its confusing statements-masquarading-as-function-calls), but I find R easiest to work with.

Therefore, I find a new study comparing the speeds of various languages for a statistical problem pretty depressing. When looking at this kind of a study, it's important to keep one big thing in mind: the authors tested the various languages on only a single task (albeit a common task, at least in economic modeling), and different languages will have different strengths and weaknesses at different tasks.

Nonetheless, the differences in run time are so large that it's not unreasonable to draw some conclusions. Even when compiled, R takes 240 to 340 times as long to run as C++. How about Python and MATLAB? Python with the default CPython compiler is nearly as slow as compiled R (155 to 269 times), but with Pypy it reaches 1/44 of the speed of C++. MATLAB takes only about 10 times as long to run as C++, or only about 50% longer when using Mex files (C, C++, or Fortran subroutines called by MATLAB). (Octave has Oct files written in C++, which serve a similar purpose; Octave can use Mex files, but not as well as MATLAB. See the GNU documentation on the subject for details.)

Wow. The botton line is that R might not be the best choice for time-consuming applications—in other words, those that have to crunch through a lot of data, especially if the calcuations involved are complex. I had read that it's slower than the alternatives, but I had no idea that the differences were so dramatic. I really should polish my Octave skills, and, judging by many of the job ads I see, knowing some C++ would not only open up possibilities for faster-running code, but would also make me more employable.

1 comment:

  1. Very well written post. Thanks for sharing this, I really appreciate you taking the time to share with everyone. Data Science Course Hyderabad

    ReplyDelete