Sunday, December 8, 2013

Summarize statistics by Groups in R & R Commander

R is great at accomplishing complex tasks. Doing simple things with R though takes some effort. Consider the simple task of producing summary statistics for continuous variables over some factor variables. Using Stata, I’d write a brief one-liner to get the mean for one or more variables using another variable as a factor. For instance, tabstat Horsepower RPM, by(Type)  in Stata produces the following:

image

The doBy package in R offers similar functionality and more. Of particular interest for those who teach R based statistics courses in the undergraduate programs is the doBy plugin for R Commander. The plugin was developed by Jonathan Lee and it is a great tool for teaching and for quick data analysis. To get the same output as the one listed above, I’d click on the doBy plugin to get the following dialogue box:

image

The dialogue box results in the following simple syntax:

summaryBy(Horsepower+RPM~Type, data=Cars93, FUN=c(mean))

You may first have to load the data set:
data(Cars93, package="MASS")

And the results are presented below:

image

Jonathan has also created GUIs for order by, sample by, and split by within the same plug-in. A must use plug-in for data scientists.

Comparing mnlogit and mlogit for discrete choice models

Earlier this weekend (Dec. 7, 2013), mnlogit was released on CRAN by Wang Zhiyu and Asad Hasan (asad.hasan@sentrana.com) claiming that mnlogit uses “parallel C++ library to achieve fast computation of Hessian matrices”.

Here is a comparison of mnlogit with mlogit by Yves Croissant whose package seems to be the inspiration for mnlogit.

I will estimate the same model using the same data set and will compare the two packages for execution speed, specification flexibility, and ease of use.

Data set

I use the Fish data set to estimate mnlogit and mlogit. mnlogit defines the data set as follows:

A data frame containing :

  • mode - The choice set: beach, pier, boat, and charter
  • price - price for a mode for an individual
  • catch - fish catch rate for a mode for an individual
  • income - monthly income of the individual decision-maker
  • chid - decision maker ID

The authors mention that they have sourced the data from R package mlogit by Yves Croissant, which lists the source as:

  • Herriges, J. A. and C. L. Kling (1999) “Nonlinear Income Effects in Random Utility Models”, Review of Economics and Statistics, 81, 62-72.

Estimation with mnlogit

library(mnlogit)


## Warning: package 'mnlogit' was built under R version 3.0.2


## Package: mnlogit Version: 1.0 Multinomial Logit Choice Models. Scientific
## Computing Group, Sentrana Inc, 2013.



data(Fish, package = "mnlogit")
fm <- formula(mode ~ 1 | income | price + catch)
summary(mnlogit(fm, Fish, "alt"))


## 
## Call:
## mnlogit(formula = fm, data = Fish, choiceVar = "alt")
##
## Frequencies of alternatives in input data:
## beach boat charter pier
## 0.113 0.354 0.382 0.151
##
## Number of observations in training data = 1182
## Number of alternatives = 4
## Intercept turned: ON.
## Number of parameters in model = 14
## # individual specific variables = 2
## # choice specific coeff variables = 2
## # generic coeff variables = 0
##
## Maximum likelihood estimation using Newton-Raphson iterations.
## Number of iterations: 7
## Number of linesearch iterations: 7
## At termination:
## Gradient 2-norm = 8.19688645245397e-05
## Diff between last 2 loglik values = 4.15543581766542e-08
## Stopping reason: Succesive loglik difference < ftol (1e-06).
## Total estimation time (sec): 0.04
## Time for Hessian calculations (sec): 0.04 using 1 processors.
##
## Coefficients :
## Estimate Std.Error t-value Pr(>|t|)
## (Intercept):boat 8.64e-01 3.15e-01 2.74 0.00607 **
## (Intercept):charter 1.85e+00 3.10e-01 5.97 2.4e-09 ***
## (Intercept):pier 1.13e+00 3.05e-01 3.71 0.00021 ***
## income:boat -1.10e-04 6.02e-05 -1.84 0.06636 .
## income:charter -2.78e-04 6.03e-05 -4.61 4.0e-06 ***
## income:pier -1.28e-04 5.33e-05 -2.41 0.01605 *
## price:beach -3.80e-02 3.33e-03 -11.41 < 2e-16 ***
## price:boat -2.09e-02 2.23e-03 -9.33 < 2e-16 ***
## price:charter -1.60e-02 2.02e-03 -7.94 2.0e-15 ***
## price:pier -3.92e-02 3.26e-03 -12.02 < 2e-16 ***
## catch:beach 4.95e+00 8.20e-01 6.04 1.5e-09 ***
## catch:boat 2.47e+00 5.19e-01 4.76 1.9e-06 ***
## catch:charter 7.61e-01 1.52e-01 4.99 6.0e-07 ***
## catch:pier 4.88e+00 8.99e-01 5.43 5.5e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1160, df = 14
## AIC: 13.8875709217033


system.time((mnlogit(fm, Fish, "alt")))


##    user  system elapsed 
## 0.11 0.00 0.11


Estimation with mlogit



I estimate the same model using mlogit.



library(mlogit)


## Loading required package: Formula Loading required package: statmod
## Loading required package: lmtest Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following object is masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: maxLik Loading required package: miscTools
## Loading required package: MASS



data2 <- mlogit.data(Fish, choice = "alt", shape = "long", id.var = "chid"
,
alt.levels
= c("beach", "boat", "charter", "pier"))
summary(mod1 <- mlogit(mode ~ 1 | income | price + catch, data2, reflevel = "beach"))


## 
## Call:
## mlogit(formula = mode ~ 1 | income | price + catch, data = data2,
## reflevel = "beach", method = "nr", print.level = 0)
##
## Frequencies of alternatives:
## beach boat charter pier
## 0.113 0.354 0.382 0.151
##
## nr method
## 7 iterations, 0h:0m:0s
## g'(-H)^-1g = 8.31E-08
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error t-value Pr(>|t|)
## boat:(intercept) 8.64e-01 3.15e-01 2.74 0.00607 **
## charter:(intercept) 1.85e+00 3.10e-01 5.97 2.4e-09 ***
## pier:(intercept) 1.13e+00 3.05e-01 3.71 0.00021 ***
## boat:income -1.10e-04 6.02e-05 -1.84 0.06636 .
## charter:income -2.78e-04 6.03e-05 -4.61 4.0e-06 ***
## pier:income -1.28e-04 5.33e-05 -2.41 0.01605 *
## beach:price -3.80e-02 3.33e-03 -11.41 < 2e-16 ***
## boat:price -2.09e-02 2.23e-03 -9.33 < 2e-16 ***
## charter:price -1.60e-02 2.02e-03 -7.94 2.0e-15 ***
## pier:price -3.92e-02 3.26e-03 -12.02 < 2e-16 ***
## beach:catch 4.95e+00 8.20e-01 6.04 1.5e-09 ***
## boat:catch 2.47e+00 5.19e-01 4.76 1.9e-06 ***
## charter:catch 7.61e-01 1.52e-01 4.99 6.0e-07 ***
## pier:catch 4.88e+00 8.99e-01 5.43 5.5e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1160
## McFadden R^2: 0.225
## Likelihood ratio test : chisq = 675 (p.value = <2e-16)


system.time(mlogit(mode ~ 1 | income | price + catch, data2, reflevel = "beach"))


##    user  system elapsed 
## 0.27 0.02 0.28


Speed Tests



I conducted a simple test for execution times with the command system.time. The results are reported above after each model summary.



Findings



  • I obtain identical results for the models estimated with mnlogit and mlogit.


  • Estimation speeds appear faster for mnlogit.


  • caveat: The same command (system.time) when run independently of the R Markdown environment shows mlogit to be faster than mnlogit!


  • Additional Comments: I restarted RStudio and estimated the two models again outside of R Markdown. mnlogit took  0.12 seconds versus 0.31 seconds for mlogit.


  • Verdict: mnlogit reports shorter execution times than mlogit.


  • Also, Estimation speeds may differ with large and complex data sets.


  • The model specification is simpler in mnlogit.


  • The commands for specifying the data set and the model seem easier in mnlogit.


  • mlogit syntax seems relatively complex, but offers more choices in model specification.

    Final comment



    I am delighted to see yet another package for discrete choice models. As the field evolves and more advanced problems are subjected to discrete choice models, the need for robust and computationally efficient packages will be felt more. Both mlogit and mnlogit are indeed valuable to those interested in how humans make good and bad choices!