Sunday, July 6, 2014

Programming Languages for Big Data

I'm a big fan of R: it just seems intuitive to me, and there's a package available for practically any type of analysis you might want to do. I have some experience with Octave (essentially the open-source version of proprietary MATLAB) and Python (which I detest, especially with its confusing statements-masquarading-as-function-calls), but I find R easiest to work with.

Therefore, I find a new study comparing the speeds of various languages for a statistical problem pretty depressing. When looking at this kind of a study, it's important to keep one big thing in mind: the authors tested the various languages on only a single task (albeit a common task, at least in economic modeling), and different languages will have different strengths and weaknesses at different tasks.

Nonetheless, the differences in run time are so large that it's not unreasonable to draw some conclusions. Even when compiled, R takes 240 to 340 times as long to run as C++. How about Python and MATLAB? Python with the default CPython compiler is nearly as slow as compiled R (155 to 269 times), but with Pypy it reaches 1/44 of the speed of C++. MATLAB takes only about 10 times as long to run as C++, or only about 50% longer when using Mex files (C, C++, or Fortran subroutines called by MATLAB). (Octave has Oct files written in C++, which serve a similar purpose; Octave can use Mex files, but not as well as MATLAB. See the GNU documentation on the subject for details.)

Wow. The botton line is that R might not be the best choice for time-consuming applications—in other words, those that have to crunch through a lot of data, especially if the calcuations involved are complex. I had read that it's slower than the alternatives, but I had no idea that the differences were so dramatic. I really should polish my Octave skills, and, judging by many of the job ads I see, knowing some C++ would not only open up possibilities for faster-running code, but would also make me more employable.


  1. We at Coepd declared Data Science Internship Programs (Self sponsored) for professionals who want to have hands on experience. We are providing this program in alliance with IT Companies in COEPD Hyderabad premises. This program is dedicated to our unwavering participants predominantly acknowledging and appreciating the fact that they are on the path of making a career in Data Science discipline. This internship is designed to ensure that in addition to gaining the requisite theoretical knowledge, the readers gain sufficient hands-on practice and practical know-how to master the nitty-gritty of the Data Science profession. More than a training institute, COEPD today stands differentiated as a mission to help you "Build your dream career" - COEPD way.