Monday, November 3, 2014

R: What to do when help doesn't arrive

R is great when it works. Not so much, when it doesn’t. Specifically, this becomes a concern when the packages are not fully illustrated in the accompanied help documentation, and the author/s of the package don’t respond (in time).

I am not suggesting that the package authors should respond to every email that they receive. My request is that the documentation should be complete enough so that the authors’ help is no longer required on a day-to-day basis.

Recently, a colleague in the US and I became interested in the mlogit package. We wanted to use the weights option in the package. Just like most other packages, mlogit does not illustrate how to use weights, but advises that the option is available. We assumed that the weights would work in a certain way (see page 26 of the hyperlinked document). However, when I estimated the model with weights, mlogit did not replicate the results from a popular textbook on econometrics. Here are the details.

We wanted to see if the the weights option could be used in an alternative specific logit formulation when the sampled data do not conform to the market shares observed in the underlying population? For instance, in a travel choice model, one may be tempted to over sample train commuters and under-sample car commuters because often car commuters far outnumber the train commuters for inter-city travel in the underlying population. This is true for most of Canada and the US.  In such circumstances, we would weight the data set so that the estimated model reproduces the population market shares rather than the sample shares.

The commercially available software, NLogit/LimDep can do this with ease. I wanted to replicate the results for choice-based weights for the conditional logit model in Professor Bill Greene's book, Econometric Analysis. This is illustrated on page 853 of the 6th edition of the text where Table 23.24 presents the parameter estimates for a conditional (McFadden) logit model for the un-weighted and the choice-based weighted versions. I replicated the results using NLogit with a simple addition of population market shares in the two-line syntax. However, the results generated by mlogit package bear no resemblance to the ones listed in Econometric Analysis.

It turns out that Stata is also limited in the way it handles weights for the two estimation options: asclogit and clogit. I know this because colleagues at Stata were quite diligent in responding to my requests. It’s not the same with the mlogit, which may or may not be able to handle the weights. We will only know when the author responds.

I am recommending that it should not be left to the individual authors to bear the sole responsibility for supporting the R packages. The individual could be ill, busy, or unavailable for a variety of reasons. This limitation could be proactively dealt with if the R community collectively generates help documentation by detailed worked-out examples of all available options (including weights), and not the few frequently used ones.

Improving documentation will be key to helping R branch out to the everyday users of statistical analysis. The tech-savvy can iron out the kinks. They have the curiosity, patience, and time on their hand. The rest of the world is not that fortunate.

I propose that users of the packages, and not just the authors, should collaborate to generate help documentation as vignettes and YouTube videos. This will do more in popularizing R than another 6,000 new packages that few may know how to work with.