Monday, November 3, 2014

R: What to do when help doesn't arrive

R is great when it works. Not so much, when it doesn’t. Specifically, this becomes a concern when the packages are not fully illustrated in the accompanied help documentation, and the author/s of the package don’t respond (in time).

I am not suggesting that the package authors should respond to every email that they receive. My request is that the documentation should be complete enough so that the authors’ help is no longer required on a day-to-day basis.

Recently, a colleague in the US and I became interested in the mlogit package. We wanted to use the weights option in the package. Just like most other packages, mlogit does not illustrate how to use weights, but advises that the option is available. We assumed that the weights would work in a certain way (see page 26 of the hyperlinked document). However, when I estimated the model with weights, mlogit did not replicate the results from a popular textbook on econometrics. Here are the details.

We wanted to see if the the weights option could be used in an alternative specific logit formulation when the sampled data do not conform to the market shares observed in the underlying population? For instance, in a travel choice model, one may be tempted to over sample train commuters and under-sample car commuters because often car commuters far outnumber the train commuters for inter-city travel in the underlying population. This is true for most of Canada and the US.  In such circumstances, we would weight the data set so that the estimated model reproduces the population market shares rather than the sample shares.

The commercially available software, NLogit/LimDep can do this with ease. I wanted to replicate the results for choice-based weights for the conditional logit model in Professor Bill Greene's book, Econometric Analysis. This is illustrated on page 853 of the 6th edition of the text where Table 23.24 presents the parameter estimates for a conditional (McFadden) logit model for the un-weighted and the choice-based weighted versions. I replicated the results using NLogit with a simple addition of population market shares in the two-line syntax. However, the results generated by mlogit package bear no resemblance to the ones listed in Econometric Analysis.

It turns out that Stata is also limited in the way it handles weights for the two estimation options: asclogit and clogit. I know this because colleagues at Stata were quite diligent in responding to my requests. It’s not the same with the mlogit, which may or may not be able to handle the weights. We will only know when the author responds.

I am recommending that it should not be left to the individual authors to bear the sole responsibility for supporting the R packages. The individual could be ill, busy, or unavailable for a variety of reasons. This limitation could be proactively dealt with if the R community collectively generates help documentation by detailed worked-out examples of all available options (including weights), and not the few frequently used ones.

Improving documentation will be key to helping R branch out to the everyday users of statistical analysis. The tech-savvy can iron out the kinks. They have the curiosity, patience, and time on their hand. The rest of the world is not that fortunate.

I propose that users of the packages, and not just the authors, should collaborate to generate help documentation as vignettes and YouTube videos. This will do more in popularizing R than another 6,000 new packages that few may know how to work with.

Sunday, December 8, 2013

Summarize statistics by Groups in R & R Commander

R is great at accomplishing complex tasks. Doing simple things with R though takes some effort. Consider the simple task of producing summary statistics for continuous variables over some factor variables. Using Stata, I’d write a brief one-liner to get the mean for one or more variables using another variable as a factor. For instance, tabstat Horsepower RPM, by(Type)  in Stata produces the following:


The doBy package in R offers similar functionality and more. Of particular interest for those who teach R based statistics courses in the undergraduate programs is the doBy plugin for R Commander. The plugin was developed by Jonathan Lee and it is a great tool for teaching and for quick data analysis. To get the same output as the one listed above, I’d click on the doBy plugin to get the following dialogue box:


The dialogue box results in the following simple syntax:

summaryBy(Horsepower+RPM~Type, data=Cars93, FUN=c(mean))

You may first have to load the data set:
data(Cars93, package="MASS")

And the results are presented below:


Jonathan has also created GUIs for order by, sample by, and split by within the same plug-in. A must use plug-in for data scientists.

Comparing mnlogit and mlogit for discrete choice models

Earlier this weekend (Dec. 7, 2013), mnlogit was released on CRAN by Wang Zhiyu and Asad Hasan ( claiming that mnlogit uses “parallel C++ library to achieve fast computation of Hessian matrices”.

Here is a comparison of mnlogit with mlogit by Yves Croissant whose package seems to be the inspiration for mnlogit.

I will estimate the same model using the same data set and will compare the two packages for execution speed, specification flexibility, and ease of use.

Data set

I use the Fish data set to estimate mnlogit and mlogit. mnlogit defines the data set as follows:

A data frame containing :

  • mode - The choice set: beach, pier, boat, and charter
  • price - price for a mode for an individual
  • catch - fish catch rate for a mode for an individual
  • income - monthly income of the individual decision-maker
  • chid - decision maker ID

The authors mention that they have sourced the data from R package mlogit by Yves Croissant, which lists the source as:

  • Herriges, J. A. and C. L. Kling (1999) “Nonlinear Income Effects in Random Utility Models”, Review of Economics and Statistics, 81, 62-72.

Estimation with mnlogit


## Warning: package 'mnlogit' was built under R version 3.0.2

## Package: mnlogit Version: 1.0 Multinomial Logit Choice Models. Scientific
## Computing Group, Sentrana Inc, 2013.

data(Fish, package = "mnlogit")
fm <- formula(mode ~ 1 | income | price + catch)
summary(mnlogit(fm, Fish, "alt"))

## Call:
## mnlogit(formula = fm, data = Fish, choiceVar = "alt")
## Frequencies of alternatives in input data:
## beach boat charter pier
## 0.113 0.354 0.382 0.151
## Number of observations in training data = 1182
## Number of alternatives = 4
## Intercept turned: ON.
## Number of parameters in model = 14
## # individual specific variables = 2
## # choice specific coeff variables = 2
## # generic coeff variables = 0
## Maximum likelihood estimation using Newton-Raphson iterations.
## Number of iterations: 7
## Number of linesearch iterations: 7
## At termination:
## Gradient 2-norm = 8.19688645245397e-05
## Diff between last 2 loglik values = 4.15543581766542e-08
## Stopping reason: Succesive loglik difference < ftol (1e-06).
## Total estimation time (sec): 0.04
## Time for Hessian calculations (sec): 0.04 using 1 processors.
## Coefficients :
## Estimate Std.Error t-value Pr(>|t|)
## (Intercept):boat 8.64e-01 3.15e-01 2.74 0.00607 **
## (Intercept):charter 1.85e+00 3.10e-01 5.97 2.4e-09 ***
## (Intercept):pier 1.13e+00 3.05e-01 3.71 0.00021 ***
## income:boat -1.10e-04 6.02e-05 -1.84 0.06636 .
## income:charter -2.78e-04 6.03e-05 -4.61 4.0e-06 ***
## income:pier -1.28e-04 5.33e-05 -2.41 0.01605 *
## price:beach -3.80e-02 3.33e-03 -11.41 < 2e-16 ***
## price:boat -2.09e-02 2.23e-03 -9.33 < 2e-16 ***
## price:charter -1.60e-02 2.02e-03 -7.94 2.0e-15 ***
## price:pier -3.92e-02 3.26e-03 -12.02 < 2e-16 ***
## catch:beach 4.95e+00 8.20e-01 6.04 1.5e-09 ***
## catch:boat 2.47e+00 5.19e-01 4.76 1.9e-06 ***
## catch:charter 7.61e-01 1.52e-01 4.99 6.0e-07 ***
## catch:pier 4.88e+00 8.99e-01 5.43 5.5e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Log-Likelihood: -1160, df = 14
## AIC: 13.8875709217033

system.time((mnlogit(fm, Fish, "alt")))

##    user  system elapsed 
## 0.11 0.00 0.11

Estimation with mlogit

I estimate the same model using mlogit.


## Loading required package: Formula Loading required package: statmod
## Loading required package: lmtest Loading required package: zoo
## Attaching package: 'zoo'
## The following object is masked from 'package:base':
## as.Date, as.Date.numeric
## Loading required package: maxLik Loading required package: miscTools
## Loading required package: MASS

data2 <-, choice = "alt", shape = "long", id.var = "chid"
= c("beach", "boat", "charter", "pier"))
summary(mod1 <- mlogit(mode ~ 1 | income | price + catch, data2, reflevel = "beach"))

## Call:
## mlogit(formula = mode ~ 1 | income | price + catch, data = data2,
## reflevel = "beach", method = "nr", print.level = 0)
## Frequencies of alternatives:
## beach boat charter pier
## 0.113 0.354 0.382 0.151
## nr method
## 7 iterations, 0h:0m:0s
## g'(-H)^-1g = 8.31E-08
## gradient close to zero
## Coefficients :
## Estimate Std. Error t-value Pr(>|t|)
## boat:(intercept) 8.64e-01 3.15e-01 2.74 0.00607 **
## charter:(intercept) 1.85e+00 3.10e-01 5.97 2.4e-09 ***
## pier:(intercept) 1.13e+00 3.05e-01 3.71 0.00021 ***
## boat:income -1.10e-04 6.02e-05 -1.84 0.06636 .
## charter:income -2.78e-04 6.03e-05 -4.61 4.0e-06 ***
## pier:income -1.28e-04 5.33e-05 -2.41 0.01605 *
## beach:price -3.80e-02 3.33e-03 -11.41 < 2e-16 ***
## boat:price -2.09e-02 2.23e-03 -9.33 < 2e-16 ***
## charter:price -1.60e-02 2.02e-03 -7.94 2.0e-15 ***
## pier:price -3.92e-02 3.26e-03 -12.02 < 2e-16 ***
## beach:catch 4.95e+00 8.20e-01 6.04 1.5e-09 ***
## boat:catch 2.47e+00 5.19e-01 4.76 1.9e-06 ***
## charter:catch 7.61e-01 1.52e-01 4.99 6.0e-07 ***
## pier:catch 4.88e+00 8.99e-01 5.43 5.5e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Log-Likelihood: -1160
## McFadden R^2: 0.225
## Likelihood ratio test : chisq = 675 (p.value = <2e-16)

system.time(mlogit(mode ~ 1 | income | price + catch, data2, reflevel = "beach"))

##    user  system elapsed 
## 0.27 0.02 0.28

Speed Tests

I conducted a simple test for execution times with the command system.time. The results are reported above after each model summary.


  • I obtain identical results for the models estimated with mnlogit and mlogit.

  • Estimation speeds appear faster for mnlogit.

  • caveat: The same command (system.time) when run independently of the R Markdown environment shows mlogit to be faster than mnlogit!

  • Additional Comments: I restarted RStudio and estimated the two models again outside of R Markdown. mnlogit took  0.12 seconds versus 0.31 seconds for mlogit.

  • Verdict: mnlogit reports shorter execution times than mlogit.

  • Also, Estimation speeds may differ with large and complex data sets.

  • The model specification is simpler in mnlogit.

  • The commands for specifying the data set and the model seem easier in mnlogit.

  • mlogit syntax seems relatively complex, but offers more choices in model specification.

    Final comment

    I am delighted to see yet another package for discrete choice models. As the field evolves and more advanced problems are subjected to discrete choice models, the need for robust and computationally efficient packages will be felt more. Both mlogit and mnlogit are indeed valuable to those interested in how humans make good and bad choices!

  • Wednesday, May 22, 2013

    What happened to six million voters?

    The recent elections in Pakistan on May 11 were a great success by all means. In spite of the threats for violence by Al-Qaeda and its local franchises in Pakistan against those who would vote, millions of Pakistanis indeed stepped out to vote for an elected government. The Election Commission of Pakistan (ECP) claimed a voter turnout of 60%.

    One would have hoped to see 50.5 million votes polled for a 60% turnout by the 84.2 million registered voters in the 262 ridings of the National Assembly for which the ECP reported results. However, ECP’s own data reported 44.9 million votes, resulting in a gap of app. 5.7 million votes. The actual turnout thus was close to 53%.


    I used R to siphon off data for 262 ridings, which ECP reported on separate web pages. The R code is presented below.


    # Get the URL prefix

    # loop through the 272 ridings
    for (i in 1:272) {
      #get the riding number
      u2<- i
      #complete the URL Address
      #Read the table
      ridedata=readHTMLTable(url2, header=T, which=8,stringsAsFactors=F)
      #Read the HTML page
      web_page <- readLines(url2)
      # Pull out the appropriate line with the riding name using the identifier "specialheading"
      ridename <- web_page[grep("Specialheading", web_page)]
      #get the starting integer for the riding name
      startx=regexpr("(", ridename, fixed=TRUE)
      #get the last digit for the riding name
      endx=regexpr("<span", ridename)
      #Generate the riding name
      #merge data in one table
      assign(paste0("fname",u2, sep=""), cbind(ridedata,riding=i,rname=ridename))

    I used a simple rbind command to assemble data in one large file after storing  individual riding data first in separate files. This was done because the server timed out several times during the execution, and it allowed me to restart from the riding where the system failed, rather than starting from the beginning every time.

    Tuesday, April 9, 2013

    Second edition of Crawley's The R Book

    The second edition of Michael Cawley's The R Book is available from Wiley. According to the publisher, the new edition boasts the following features:

    • "Features full colour text and extensive graphics throughout.
    • Introduces a clear structure with numbered section headings to help readers locate information more efficiently.
    • Looks at the evolution of R over the past five years.
    • Features a new chapter on Bayesian Analysis and Meta-Analysis.
    • Presents a fully revised and updated bibliography and reference section.
    • Is supported by an accompanying website allowing examples from the text to be run by the user."

    At 1076 pages, this continues to be the most comprehensive text on R.

    Wednesday, February 27, 2013

    Workshops on Modelling Choices using R in Toronto

    Making choices is inherently human. We choose between brands of cereal or amongst candidates in an election. At times, choices may be influenced by the characteristics of the decision maker, such as age, income and sex. Choices may also be influenced by the attributes of competing alternatives, such as the cost of travelling between two cities by air or rail. At other times, choices are influenced by both.

    Analyzing choices can be tricky. Practitioners and researchers have developed numerous statistical techniques to analyze and model choices. This workshop will offer applied, hands-on training in analyzing choices.

    The workshops will be offered in two sessions. First session will focus on binary (yes/no) choices and introduce the basic assumptions about choice analysis. It will provide hands-on training on exploratory data analysis. Second session will focus on advanced topics in choice modelling including multiple (multinomial) choices, elasticities, and estimating market shares.

    Participants are expected to bring their own laptops. Basic concepts will be illustrated in SPSS, Stata, and R.

    Title: Workshop on Modelling Choices

    Dates and Time: Session One - Friday, March 22, 2013 (2pm-5pm)

    Session Two - Friday, March 29, 2013 (2pm-5pm)

    Instructor: Murtaza Haider, Ph.D.

    Location: Ted Rogers School of Management, Ryerson University, 55 Dundas Street West, Room 3-119, Toronto M5G 2C3 

    Registration fee: The workshop is sponsored by the Dean’s office at the Ted Rogers School of Management and is offered free-of-cost to the Ryerson community.

    Please RSVP by emailing

    Registration will be restricted to 25 participants.

    Monday, February 4, 2013

    Help needed with sample selection biases

    We are searching for a graduate student to assist us on a very short assignment about sample selection biases and Heckman Probit models. The help is not needed for estimating the models, but instead for reviewing the scenarios where the use of such models is theoretically appropriate or otherwise. For instance, we are particularly interested in determining if Heck Probit type models could be applied in situations where the response variable had the don’t know/refused option, which has been used for the selection equation in some published research. We seek help in understanding the assumptions in the model that would permit or restrict the use of Heck Probit model in such circumstances.

    If interested, please email Murtaza Haider at