Monday, April 11, 2011

Speeding tickets for R and Stata

How fast is R? Is it as fast in executing routines as the other off-the-shelf software, such as Stata? After some comparative experimentation, I found Stata to be 5 to 8  times faster than R.

For me, speed has not been a concern in the past. I had used R with smaller datasets of roughly 5000 to 10,000 observations and found it to be as fast as other statistical software. More recently, I have been working with still a relatively small-sized data set of 63,122 observations. After realizing that R was very slow in executing the built-in routines for multinomial and ordinal logit models, I ran similar models in Stata with the same data set and found Stata to be much faster than R.

Before I go any further, I must confess that I did not try to determine ways to improve speed in R by, for instance, choosing  faster converging algorithms. I hope readers would send me comments on how to speed-up execution for the routines I tested in R.

My data set comprised an ordinal dependant variable [5 categories] and categorical explanatory variables with 63,122 observations. I used a computer running Windows 7 on Intel Core 2 Quad CPU Q9300 @ 2.5 GHz with 8 GB of RAM. Further details about the test are listed in the following Table.

Software Routine

Stata 11 (duo core)

R (2.12.0)

Multinomial Logit mlogit, 9.06 seconds multinom, 50.59 seconds
zelig (mlogit), 77.89 sec
VGLM (multinomial), 64.4 sec
Proportional odds model ologit, 1.69 sec VGLM (parallel = T), 16.26 sec
polr, 22.62 seconds
Generalized Logit gologit2, 18.67 sec VGLM (parallel = F), 64.71 sec

I first estimated the standard multinomial logit model in R using the multinom routine. R took almost 51 seconds to return the results. The subsequent call to summarise the model took another 52.29 seconds, thus making the total execution time in R to be 103 seconds. Surprised at the slow speed, I tried other options in R to estimate the same model. I first tested mlogit option in Zelig. The execution time was even slower at 78 seconds. I followed up with VGAM package, which returned a slightly better result with 64.4 seconds.

Other examples listed above suggest similar slower times for R in comparison with Stata.

What could be the reason for such an order of magnitude difference in speed between R and Stata. I unfortunately don’t have the answer. I do know that Revolution Analytics offers similar performance benchmark comparisons between their version of souped-up R (Revolution R) and the generic R. Revolution R was found to be five to eight times faster than regular R.

image

Other performance benchmarks revealed even greater speed differentials between Revolution R and the generic R.

image

There must be ways to make routines execute faster in R. A few weeks earlier, Professor John Fox ( a long-time contributor to R and the programmer of the R GUI, R Commander) delivered a guest lecture at the Ted Rogers School of Management in Toronto at the GTA R Users’ Group meeting. His talk focussed on how to program using binary logit model as an example. His code for binary logit was found to be much faster than the one that comes bundled with the GLM in R.

This makes me wonder: are there ways to make the generic R run faster?

8 comments:

  1. Did you try to run R's mlogit package without zelig in your Multinomial Logit comparison (http://cran.r-project.org/web/packages/mlogit/index.html)? Maybe zelig has a negative impact.

    Best,
    Marcel Gerds

    ReplyDelete
  2. The bulk of the speed gains in Revolution R are obtained by using the Intel MKL library which offers the typical BLAS and LAPACK functions, but is highly optimized and multithreaded. I could not find the info in my quick google search, but I'm quite confident that Stata also ships with it's own optimized math library, which should also be multithreaded (hence the "duo core" in the software title). You should get considerably better benchmark results if you link your version of R to a library that is optimized for your system. I don't know if it will beat Stata, but it will almost certainly bring you closer. 

    ReplyDelete
  3. Interesting! Both R and Stata keep their data in random access memory. Have you compared the two on how much data they can analyze? I read somewhere that Stata was extremely efficient in the way it handles memory. Ross Ihaka mentioned in a talk at JSM that R's lm function made several copies of the data as it worked. It would be interesting to see if they really differed much in this regard. Cheers, Bob

    ReplyDelete
  4. Although the focus of the R core team is not on performance optimization, I believe that the speed of the two languages should be not too different. The difference in performance on specific functions is not a feature of the language itself, but of the underlying numerical routines, which are usually written in C or F77. Their performance is not primarily driven by BLAS and LAPACK implementations, provided the the latter are compiled with the right flags. Once this is done (yielding a speedup of 3x on matrix functions), ATLAS and MLK can differ by 20%, not 500%. Most likely the main cause of the remaining gap in performance is in the algorithm for Max Likelihood optimization. There can be dramatic speedups (even 20x) by choosing the right algorithm and customizing it for the function at hand. I think improving performance for glm solver would be an excellent "summer of code" project. There is definitely room for improvement there.

    ReplyDelete
  5. Received anonymously:

    You might try again with an optimized BLAS from < http://prs.ism.ac.jp/~nakama/SurviveGotoBLAS2/binary/>.

    ReplyDelete
  6. @rara avis

    Seems like the whole point is that the standard R installation is not compiled with all the right flags. This makes the single download more portable to different system configurations, but has a performance cost. My assumption was that the original poster had used a vanilla R install from the r-project website. This suggests that his math libraries were not multithreaded, or as optimized as the intel mkl, or that all the optimization flags were used. Taken together, these factors probably explain a good chunk of the disparity between results, no?

    ReplyDelete
  7. When claiming that one software package is faster than other in general, care must be taken. Now you claim that Stata is faster than R 5-8 times in GENERAL, when the actual claim is that it is 5-8 times faster in performing one specific task, i.e. fitting multinomial models.

    ReplyDelete
  8. Would be nice to see the output. Maybe stata is working to a lower precision.

    ReplyDelete