tag:blogger.com,1999:blog-18164887972139540662024-03-05T17:18:04.731-05:00eKonometricsA commentary on consumer marketsMurtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.comBlogger320125tag:blogger.com,1999:blog-1816488797213954066.post-50143710748512891472022-09-27T14:16:00.006-04:002022-09-27T14:17:57.515-04:00How to append two tables in R Markdown?<p> Here is the task: <b>how to append two tables using <span style="font-family: courier;">R Markdown? </span></b>The need arose because
I was demonstrating to graduate students in a research methods course how to
prepare Table 1, which often covers descriptive statistics in an empirical
paper.</p>
<p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;">I used <span style="font-family: courier;"><a href="https://cran.r-project.org/web/packages/tableone/vignettes/introduction.html" target="_blank">tableone</a> </span>package in R to compute the summary statistics. The task was to replicate the
first table from <a href="https://www.nber.org/system/files/working_papers/w9853/w9853.pdf" target="_blank">Prof. Daniel Hamermesh’s paper</a> that explored whether
instructors’ appearance and looks influenced the teaching evaluation score
assigned by the students. Since Prof. Hammermesh computed some summary
statistics using weighted data, such as weighted mean and weighted standard
deviations, and non-weighted data using regular means and standard deviations, I relied on
two different commands in <span style="font-family: courier;">tableone </span>to compute summary statistics.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;">The
challenge was to combine the output from the two tables into one table. Once I
generated the two tables separately, I used <span style="font-family: courier;">kables()</span> and <span style="font-family: courier;">list()</span> options to generate
the appended table. I needed <span style="font-family: courier;"><a href="https://www.r-project.org/nosvn/pandoc/knitr.html" target="_blank">knitr</a> </span>and <span style="font-family: courier;"><a href="https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html" target="_blank">kableExtra</a> </span>packages to format the table. Here is how the apended looks.</span></p><p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p></o:p></span></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEju-sgjARbFf-FhVgHiqs0HKyqQ-PZplJKgu-YM2voVdElGhmQ9hjnEgyRFQUtPxlt5xxvtDRZaiEIHFqXiOqkedNQ77oN5lyFkp9ThehRxw0wBteG-7WZPAIavBKlSXf1246nuTvVun7523f838HGknmZipBFwgxc8qL32Z96gSFDkt3DbAUpiBcDV" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="262" data-original-width="650" height="161" src="https://blogger.googleusercontent.com/img/a/AVvXsEju-sgjARbFf-FhVgHiqs0HKyqQ-PZplJKgu-YM2voVdElGhmQ9hjnEgyRFQUtPxlt5xxvtDRZaiEIHFqXiOqkedNQ77oN5lyFkp9ThehRxw0wBteG-7WZPAIavBKlSXf1246nuTvVun7523f838HGknmZipBFwgxc8qL32Z96gSFDkt3DbAUpiBcDV=w400-h161" width="400" /></a></div><br /><p></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;">Here are
the steps involved.<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left: 0.25in;"><span lang="EN-US" style="mso-ansi-language: EN-US;">Assume that you have two tables generated by either <span style="font-family: courier;">svyCreateTableOne</span> or
<span style="font-family: courier;">CreateTableOne </span>commands. Let’s store the results in objects <span style="font-family: courier;">tab1 </span>and <span style="font-family: courier;">tab2</span>.<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left: 0.25in;"><span lang="EN-US" style="mso-ansi-language: EN-US;">In R Markdown using RStudio, print the tables to objects named
arbitrarily as p and p1. See the code below. The <span style="font-family: courier;">results=’hide’ </span>is needed if
you do not want to see the tables outputted in the draft as text.<o:p></o:p></span></p>
<p class="MsoNormal" style="margin-left: 0.25in;"><span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left: 0.25in;"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">```{r, echo=FALSE, results='hide'}<o:p></o:p></span></span></p>
<p class="MsoNormal" style="margin-left: 0.25in;"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">p <- print(tab1)<o:p></o:p></span></span></p>
<p class="MsoNormal" style="margin-left: 0.25in;"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">p2 <-print(tab2)<o:p></o:p></span></span></p>
<p class="MsoNormal" style="margin-left: 0.25in;"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">```</span><o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;">The
amalgamated table used the following script. Note some import considerations.<o:p></o:p></span></p>
<ol style="text-align: left;"><li>I used <span style="font-family: courier;">bottomrule=NULL</span> to suppress the horizontal line for the table on the top. </li><li>I used <span style="font-family: courier;">column_spec(1, width = '1.75in') </span>for both tables so that the second and subsequent columns lineup vertically. Otherise, they will appear staggered. </li><li>I used <span style="font-family: courier;">col.names = NULL</span> to suppress column names for the bottom table because the column names are the same for both tables. </li><li>I used <span style="font-family: courier;">column_spec(5, width = '.7in') </span>to ensure that the horizontal lines drawn for the bottom table match the width of the horizontal line on top of the first table. </li><li>I used <span style="font-family: courier;">kable_styling(latex_options = "HOLD_position") </span>to ensure that the table appears at the correct place in the text.</li></ol><p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p> </o:p></span>I wish
there was an easy command to fix the table width, but I didn’t find one. Still, I am quite pleased with the final output. I look forward to seeing ideas on improving the layout of appended tables.</p>
<p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">```{r,
echo=FALSE}<o:p></o:p></span></span></p>
<p class="MsoNormal"><span style="font-family: courier;">kables(</span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;"><span style="mso-spacerun: yes;"> </span>list(<o:p></o:p></span></span></p>
<blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p class="MsoNormal" style="text-align: left;"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">kable(p, booktabs=TRUE, format =
"latex",valign='t',<span style="mso-spacerun: yes;">
</span>bottomrule=NULL) %>% </span></span></p></blockquote>
<blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p class="MsoNormal" style="text-align: left;"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;"><span style="mso-spacerun: yes;"> </span>column_spec(1, width = '1.75in'),</span></span></p></blockquote></blockquote>
<blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px;"><p class="MsoNormal" style="text-align: left;"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">kable(p2, booktabs=TRUE, format =
"latex", valign='t', col.names = NULL) %>% </span></span></p></blockquote>
<blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">column_spec(1, width = '1.75in') %>% </span></span></p></blockquote><blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">column_spec(5, width = '.7in')</span></span><span style="font-family: courier;">)</span><span style="font-family: courier;">, </span></p></blockquote></blockquote>
<blockquote style="border: none; margin: 0px 0px 0px 40px; padding: 0px; text-align: left;"><p class="MsoNormal"><span style="font-family: courier;">caption="Weighted and unweighted
data"</span><span style="font-family: courier;">) %>%</span></p><p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;"><span style="font-family: courier;">kable_styling(latex_options =
"HOLD_position")</span></span></p></blockquote>
<p class="MsoNormal"><span style="font-family: courier;">```</span></p>
<p class="MsoNormal"><span lang="EN-US" style="mso-ansi-language: EN-US;"><o:p> </o:p></span></p>Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-37310858818632615742019-05-20T23:29:00.000-04:002019-05-21T10:59:59.097-04:00Modern Data Science with R: A review<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div class="MsoNoSpacing">
Some say
data is the new oil. Others equate its worth to water. And then there are those
who believe that data scientists will be (in fact, they already are) one of the
most sought-after workers in knowledge economies.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">Millions
of data-centric jobs require millions of trained data scientists. However, the
installed capacity of graduate and undergraduate programs in data science is
nowhere near meeting this demand over the next many years. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">So how
do we produce data scientists?<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">Given
the enormous demand for data scientists and the fixed supply from higher
education institutions, it is quite likely that one must look beyond colleges
and universities to train a large number of data scientists desired by the
marketplace. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://images-na.ssl-images-amazon.com/images/I/51lue2aNNUL._SX350_BO1,204,203,200_.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="499" data-original-width="352" height="320" src="https://images-na.ssl-images-amazon.com/images/I/51lue2aNNUL._SX350_BO1,204,203,200_.jpg" width="225" /></a></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">Getting
trained on the job is one possible route. This will require repurposing the existing
workforce. To prepare the current workforce for data science, one needs training
manuals. One such manual is <i style="mso-bidi-font-style: normal;"><a href="https://mdsr-book.github.io/">Modern Data Science with R</a> (</i>MDSR<i style="mso-bidi-font-style: normal;">). </i><o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">Published
by the CRC Press (Taylor and Francis Group) and authored by three leading
academics: Professors Baumer, Kaplan, and Horton, MDSR is the missing manual
for data science with R. The book is equally relevant to data science programs
in higher ed as it is to practitioners who would like to embark on a career in
data science or to get a taste of an aspect of data science that they have not
explored in the past.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">As the
book’s name suggests, the text is based on R, one of the most popular and versatile
computing platforms. <a href="https://cran.r-project.org/">R is a freeware</a>
and is being developed by thousands of volunteers in real time. In addition to base
R, which comes bundled with thousands of commands and functions, the user-written
packages, whose number has exceeded 14,000 (as of May 2019), further expand the
universe of features making R perhaps the most diverse computing platform.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">Despite the
immense popularity of data science, only a handful of titles focus exclusively
on the topic. Hadley Wickham’s <i style="mso-bidi-font-style: normal;"><a href="https://r4ds.had.co.nz/">R for Data Science</a></i> and <i style="mso-bidi-font-style: normal;"><a href="http://shop.oreilly.com/product/9780596809164.do">R Cookbook</a></i> by
Paul Teetor are the other two other worthy texts. MDSR is unique in the sense
that it serves as an introduction to a whole host of analytic techniques that are
seldom covered in one title.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">In the
following paragraphs, I’ll discuss the salient features of the textbook. I begin
with my favourite attribute of the book that deals with its organization. Instead
of muddling with theories and philosophies, the book gets straight to business and
starts the conversation with data visualization. A graphic is worth a thousand
words, and MDSR is proof of it.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">And
since <a href="https://en.wikipedia.org/wiki/Hadley_Wickham">Hadley Wickham’s</a>
influence on data science is ubiquitous, MDSR also embraces Wickham’s implementation
of <a href="https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448">Grammar
of Graphics</a> in R with one of the most popular R packages, <a href="https://ggplot2.tidyverse.org/">ggplot2</a>. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">Another
avenue where Wickham’s influence is widely felt is data wrangling. A suite of R
packages bundled under the broader rubric of <a href="https://www.tidyverse.org/">Tidyverse</a> is influencing how data
scientists manipulate small and big data. Chapter 4 in MDSR perhaps is one of
the best and succinct introduction to data wrangling with R and Tidyverse. From
the simplest to more advanced examples, MDSR equips the beginner with the
basics and the advanced user with new ways to think about analyzing data.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">A key feature
of MDSR is that it’s not another book on statistics or econometrics with R.
Yours truly is guilty of authoring one such book. Instead, MDSR is a book
focused squarely on data manipulation. The treatment of statistical topics is
not absent from the book; however, it’s not the book’s focus. It is for this
reason that the discussion on Regression models is in the appendices.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">But make
no mistake, even when statistics is not the focus, MDSR offers sound advice on the
practice of statistics. Section 7.7, The perils of p-values, warns the novice statisticians
about not becoming the unsuspecting victims of hypothesis testing.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">The
books distinguishing feature remains the diversity of data science challenges
it addresses. For instance, in addition to data visualization, the book offers
an introduction to interactive data graphics and dynamic data visualization. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">At the
same time, the book covers other diverse topics, such as database management
with SQL, working with spatial data, analyzing text-based (non-structured)
data, and the analysis of networks. A discussion about ethics in data science
is undoubtedly a welcome feature in the book.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">The book
is punctuated with hundreds of useful and hands-on data science examples and exercise,
providing ample opportunities to put concepts to practise. The book’s <a href="https://mdsr-book.github.io/">accompanying website</a> offers additional
resources and code examples. At the time of this review, not all code was
available for download. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">Also,
while I was able to reproduce more straightforward examples, I ran into trouble
with complex ones. For instance, I could not generate advanced spatial maps showing
flights origins and destinations. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">My
recommendation to authors will be to maintain an active supporting website
because R packages are known to evolve, and some functionality may change or
disappear over time. For instance, the mapping algorithms that are part of the <a href="https://cran.r-project.org/web/packages/ggmap/index.html">ggmap</a>
package now require a Google maps API or else the maps will not display. This change
has likely occurred after the book was published.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;">In summary,
for aspiring and experienced data scientists, <i style="mso-bidi-font-style: normal;">Modern Data Science with R</i> is a book deserving to be in their
personal libraries.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US" style="mso-ansi-language: EN-US;"><a href="https://www.ryerson.ca/tedrogersschool/bm/programs/real-estate-management/murtaza-haider/">Murtaza
Haider</a> lives in Toronto and teaches in the Department of Real Estate
Management at Ryerson University. He is the author of <i style="mso-bidi-font-style: normal;"><a href="http://www.informit.com/store/getting-started-with-data-science-making-sense-of-data-9780133991024">Getting
Started with Data Science: Making Sense of Data with Analytics</a></i>, which
was published by the IBM Press/Pearson.</span></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-49873097160724187432018-10-08T12:56:00.001-04:002018-10-08T14:28:33.936-04:00A question and an answer about recoding several factors simultaneously in R<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="MsoNoSpacing">
<span lang="EN-US">Data manipulation is a breeze with amazing packages like </span><b><span lang="EN-US" style="font-family: "courier new"; font-size: 10.0pt;">plyr</span></b><span lang="EN-US"> and </span><b><span lang="EN-US" style="font-family: "courier new"; font-size: 10.0pt;">dplyr</span></b><span lang="EN-US">. Recoding factors, which could prove to be a daunting task especially for variables that have many categories, can easily be accomplished with these packages. However, it is important for those learning Data Science to understand how the basic R works.</span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US">In this regard, I seek help from R specialists about recoding factors using the base R. My question is about why one notation in recoding factors works while the other doesn’t. I’m sure for R enthusiasts, the answer and solution are straightforward. So, here’s the question.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-US">In the following code, I generate a vector with five categories and 300 observations. I convert the vector to a factor and tabulate it.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi74h2kxvZ7A-jH7YbuuZGpuUcc1840LTZxZGeifSjdF0IgAwk7QmX-61KU8257ZaTKcs9KjiD3Cs6Egpgde2nAAsbg4KbprzeMEBtqfUqkCNeZ5sxOgPULovrODBuaZ23gxKezr2tqbHg/s1600/Capture+1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="188" data-original-width="597" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi74h2kxvZ7A-jH7YbuuZGpuUcc1840LTZxZGeifSjdF0IgAwk7QmX-61KU8257ZaTKcs9KjiD3Cs6Egpgde2nAAsbg4KbprzeMEBtqfUqkCNeZ5sxOgPULovrODBuaZ23gxKezr2tqbHg/s1600/Capture+1.PNG" /></a></div>
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Note that by using <b><span lang="EN-US" style="font-family: "courier new"; font-size: 10.0pt;">as.numeric</span></b> option, I could see the internal level structure for the respective character notation. Let’s say, I would like to recode categories <span lang="EN-US" style="font-family: "courier new"; font-size: 10pt;"><b>a</b> </span>and <b><span lang="EN-US" style="font-family: "courier new"; font-size: 10.0pt;">f </span></b>as missing. I can accomplish this with the following code.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_bcJAUjF5a4bm4fuQjXjHZdI_s1sUU5sxtZs0ze9jFg-zYj8EGZrunicFdlkRnkhR3X2tMlHW42rZMm05ZvY3nF8e3I_ehcndccG7QRCGuP1JU0JXk4H3qbYhz3jp9j1GZ13FYiHKPHU/s1600/Capture+2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="83" data-original-width="355" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_bcJAUjF5a4bm4fuQjXjHZdI_s1sUU5sxtZs0ze9jFg-zYj8EGZrunicFdlkRnkhR3X2tMlHW42rZMm05ZvY3nF8e3I_ehcndccG7QRCGuP1JU0JXk4H3qbYhz3jp9j1GZ13FYiHKPHU/s1600/Capture+2.PNG" /></a></div>
<br />
<br /></div>
<div class="MsoNoSpacing">
Where 1 and 6 correspond to <b><span lang="EN-US" style="font-family: "courier new"; font-size: 10.0pt;">a </span></b>and <b><span lang="EN-US" style="font-family: "courier new"; font-size: 10.0pt;">f.</span></b><o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Note that I have used the position of the levels rather than the levels themselves to convert the values to missing. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So far so good.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Now let’s assume that I would like to convert categories <b><span lang="EN-US">a </span></b>and <b><span lang="EN-US">f </span></b>to <i>grades</i>. The following code, I thought, would work, but it didn’t. It returns varying and erroneous answers.<o:p></o:p></div>
<div class="MsoNoSpacing">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHr-SWcLaiOnqMZ0bybdis2gXa02v54F1VEll8b9UoPG5mEkFFC6f10Ob1UfR8jbhFkFjbBqtNfCFePBBY9RP2xG8Ik6hOSNeX6yDvSZSdvbDYzkKJ7-S-rpVWwGNHtM1n9G5bns_13RY/s1600/Capture+3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="157" data-original-width="611" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHr-SWcLaiOnqMZ0bybdis2gXa02v54F1VEll8b9UoPG5mEkFFC6f10Ob1UfR8jbhFkFjbBqtNfCFePBBY9RP2xG8Ik6hOSNeX6yDvSZSdvbDYzkKJ7-S-rpVWwGNHtM1n9G5bns_13RY/s1600/Capture+3.PNG" /></a></div>
However, when I refer to levels explicitly, the script works as intended. See the script below.</div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivi6QsX3cE4ILIeF0gjX14V7KzBhNjg2LXBcKolzLyR4oUGVVvCGbXDWOHd-oE-DmamAe-JpJXA6F0CmhwaNp4pfYlmCoZUjtnNkL17hsvJxVXe1Frmx5wcii6upN-dKxOM2DUzFRijL4/s1600/Capture+4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="151" data-original-width="599" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivi6QsX3cE4ILIeF0gjX14V7KzBhNjg2LXBcKolzLyR4oUGVVvCGbXDWOHd-oE-DmamAe-JpJXA6F0CmhwaNp4pfYlmCoZUjtnNkL17hsvJxVXe1Frmx5wcii6upN-dKxOM2DUzFRijL4/s1600/Capture+4.PNG" /></a></div>
Hence the question: Why one method works and the other doesn’t? Looking forward to responses from R experts.<br />
<br />
<h2 style="text-align: left;">
The Answer</h2>
<br />
lebatsnok (https://stackoverflow.com/users/2787952/lebatsnok) answered the question on <a href="https://stackoverflow.com/questions/52695857/a-question-about-recoding-multiple-factor-levels-simultaneously-in-r" target="_blank">stackoverflow</a>. The solution is simple. The following code works:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjXcGA2Rn4WasXN_q3VYLlaX7CppmbywTXTSXzVcCgiHq_MVc0hJR1O73XfLZ4jYkscYYne9JlKnhOl9wkd7QoYJLSLNwCYYXU0AfwGQii2z0cqGHpTdpdkeQKaI3yAUUFJntssHcNBz8/s1600/Capture+5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="45" data-original-width="245" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjXcGA2Rn4WasXN_q3VYLlaX7CppmbywTXTSXzVcCgiHq_MVc0hJR1O73XfLZ4jYkscYYne9JlKnhOl9wkd7QoYJLSLNwCYYXU0AfwGQii2z0cqGHpTdpdkeQKaI3yAUUFJntssHcNBz8/s1600/Capture+5.PNG" /></a></div>
<br />
<div class="MsoNoSpacing" style="text-decoration-color: initial; text-decoration-style: initial;">
<div style="margin: 0px;">
<br /></div>
<div style="margin: 0px;">
The problem with my approach, as explained by lebastsnok, is the following:</div>
<div style="margin: 0px;">
<br /></div>
<blockquote class="tr_bq">
"levels(x) is a character vector with length 6, as.numeric(x) is a logical vector with length 300. So you're trying to index a short vector with a much longer logical vector. In such an indexing, the index vector acts like a "switch", TRUE indicating that you want to see an item in this position in the output, and FALSE indicating that you don't. So which elements of levels(x) are you asking for? (This will be random, you can make it reproducible with set.seed if that matters."</blockquote>
</div>
<div class="MsoNoSpacing" style="text-decoration-color: initial; text-decoration-style: initial;">
<br /></div>
<div class="MsoNoSpacing" style="text-decoration-color: initial; text-decoration-style: initial;">
<br /><div style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; margin: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">
<br /></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; margin: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">
<br /></div>
<div style="-webkit-text-stroke-width: 0px; color: black; font-family: "Times New Roman"; font-size: medium; font-style: normal; font-variant-caps: normal; font-variant-ligatures: normal; font-weight: 400; letter-spacing: normal; margin: 0px; text-transform: none; white-space: normal; word-spacing: 0px;">
<br /></div>
</div>
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<br />
<div class="MsoNoSpacing">
<br /></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-47744374935666038192018-05-21T10:21:00.002-04:002018-05-21T10:21:15.745-04:00Edward Tufte’s Slopegraphs and political fortunes in Ontario<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: justify;">
<span style="text-align: left;">With fewer than three weeks left in the June 7 provincial elections in Ontario, Canada’s most populous province with 14.2 million persons, the expected outcome is far from certain.</span></div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The weekly opinion polls reflect the volatility in public opinion. Progressive Conservatives (PC), one of the main opposition parties, is in the lead with the support of roughly 40 percent of the electorate. The incumbent Ontario Liberals are trailing with their support hanging around lower 20 percent.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The real story in these elections is the unexpected rise in the fortunes of the New Democratic Party (NDP) that has seen a sustained increase in its popularity from less than 20 percent a few weeks ago to mid 30 percent.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
As a data scientist/journalist, I have been concerned with how best to represent this information. A scatter plot of sorts would do. However, I would like to demonstrate the change in political fortunes over time with the x-axis representing time. Hence, a time series chart would be more appropriate.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Ideally, I would like to plot what Edward Tufte called a <a href="https://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0003nk">Slopegraph</a>. Tufte, in his 1983 book <i>The Visual Display of Quantitative Information</i>, explained that “Slopegraphs compare changes usually over time for a list of nouns located on an ordinal or interval scale”. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
But here’s the problem. No software offers a readymade solution to draw a Slopegraph.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Luckily, I found a way, in fact, two ways, around the challenge with help from colleagues at <a href="http://www.stata.com/">Stata</a> and R (<a href="https://cran.r-project.org/web/packages/plotrix/index.html">plotrix</a>).<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So, what follows in this blog is the story of the elections in Ontario described with data visualized as Slopegraphs. I tell the story first with Stata and then with the <a href="https://cran.r-project.org/web/packages/plotrix/index.html">plotrix</a> package in R.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
My interest grew in Slopegraphs when I wanted to demonstrate the steep increase in highly leveraged mortgage loans in Canada from 2014 to 2016. I generated the chart in Excel and sent it to Stata requesting help to recreate it. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Stata assigned my request to <a href="https://www.linkedin.com/in/derek-wagner-127b033/">Derek Wagner</a> whose excellent programming skills resulted in the following chart.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHhRsLVe04WwT2h_Oo2l0iJJtPRKZbYHXBgzeUtuFBZGay8dzMyrbW_6FIx0zYP9yjKKOT_edkDucx6FCr8exiZ0q6vXQ8jMm1cGir26HNlbhC-PTWdTnQIpOZi2kaFpwsJs8ulgTjVW0/s1600/mortgage.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="637" data-original-width="703" height="361" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiHhRsLVe04WwT2h_Oo2l0iJJtPRKZbYHXBgzeUtuFBZGay8dzMyrbW_6FIx0zYP9yjKKOT_edkDucx6FCr8exiZ0q6vXQ8jMm1cGir26HNlbhC-PTWdTnQIpOZi2kaFpwsJs8ulgTjVW0/s400/mortgage.png" width="400" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Derek built the chart on the <a href="https://www.stata.com/statalist/archive/2003-07/msg00194.html">linkplot</a> command built by the uber Stata guru, Professor <a href="https://www.dur.ac.uk/geography/staff/geogstaffhidden/?id=335">Nicholas J. Cox</a>. However, a straightforward application of linkplot still required a lot of tweaks that Derek very ably managed. For comparison, see the initial version of the chart generated by linkplot below.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZNBegoI0SFzwLESvNqylW3kKoZ0KrDWIjZnTr8HJRo_IZuP0SxlWQ7NpqQR_SEB0rqvVqHwj_v6P6z2EbVHAiS0m-mIaLBhTMw43CWujYBGlMQKAzEKJvkIN6wXC1AOUeOJnbIbBJT4c/s1600/linkplot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="512" data-original-width="703" height="291" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiZNBegoI0SFzwLESvNqylW3kKoZ0KrDWIjZnTr8HJRo_IZuP0SxlWQ7NpqQR_SEB0rqvVqHwj_v6P6z2EbVHAiS0m-mIaLBhTMw43CWujYBGlMQKAzEKJvkIN6wXC1AOUeOJnbIbBJT4c/s400/linkplot.png" width="400" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div>
</div>
<div class="MsoNoSpacing">
We made the following modifications to the base linkplot:<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->1.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Narrow the plot by reducing the space between the two time periods.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->2.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Label the entities and their respective values at the primary and secondary y-axes. <o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->3.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Add a title and footnotes (if necessary). <o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->4.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Label time periods with custom names.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->5.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Colour lines and symbols to match preferences.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Once we apply these tweaks a Slopegraph with the latest poll data for Ontario’s election is drawn as follows.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU51KGGPojDowcc6QDDsWONRtzozZLnVZ0hcZwj7EOqxF5JHlAwAyS20PtLzOQ_govTPhDRA819Y7MO_AEkDa1EiAMk_CkZrKsNrPT10OCMxUlJfGjvFF-BtoR19xw5PaaRpSPp3WllDI/s1600/Abacus+stata.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="701" data-original-width="704" height="397" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjU51KGGPojDowcc6QDDsWONRtzozZLnVZ0hcZwj7EOqxF5JHlAwAyS20PtLzOQ_govTPhDRA819Y7MO_AEkDa1EiAMk_CkZrKsNrPT10OCMxUlJfGjvFF-BtoR19xw5PaaRpSPp3WllDI/s400/Abacus+stata.png" width="400" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Notice that in fewer than two weeks, NDP has jumped from 29 percent to 34 percent, almost tying up with the leading PC party whose support has remained steady at 35 percent. The incumbent Ontario Liberals appear to be in free fall from 29 percent to 24 percent.</div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I must admit that I have sort of cheated in the above chart. Note that both Liberals and NDP secured 29 percent of the support in the poll conducted on May 06. In the original chart drawn with Stata’s code, their labels overlapped resulting in unintelligible text. I fixed this manually by manipulating the image in PowerPoint.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I wanted to replicate the above chart in R. I tried a few packages, but nothing really worked until I landed on the plotrix package that carries the bumpchart command. In fact, Edward Tufte in <i>Beautiful Evidence</i> (2006) mentions that bumpcharts may be considered as slopegraphs.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
A straightforward application of bumpchart from the plotrix package labelled the party names but not the respective percentages of support each party commanded.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<a href="https://www.r-pkg.org/maint/drjimlemon@gmail.com">Dr. Jim Lemon</a> authored bumpchart. I turned to him for help. Jim was kind enough to write a custom function, bumpchart2, that I used to create a Slopegraph like the one I generated with Stata. For comparison, see the chart below.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOqt9Mkk01MTA3XIAY_R9xNexMXBtmLuY7wkNNcCnAIlnglFGPUQbKwwoaPcddTd_YdFNMauMVKrXPjo-E_gzJYbkPHBAMqbPwirC69kyuXYFDk_l9LG6luqInLN2yRktvYyy4T0ZgSKc/s1600/Abacus+plotrix.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="904" data-original-width="704" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhOqt9Mkk01MTA3XIAY_R9xNexMXBtmLuY7wkNNcCnAIlnglFGPUQbKwwoaPcddTd_YdFNMauMVKrXPjo-E_gzJYbkPHBAMqbPwirC69kyuXYFDk_l9LG6luqInLN2yRktvYyy4T0ZgSKc/s400/Abacus+plotrix.png" width="311" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
As with the Slopegraph generated with Stata, I manually manipulated the labels to prevent NDP and Liberal labels from overlapping.</div>
<div class="MsoNoSpacing">
</div>
<h2 style="text-align: left;">
Data Scientist must dig even deeper</h2>
The job of a data scientist, unlike a computer scientist or a statistician, is not done by estimating models and drawing figures. A data scientist must tell a story with all caveats that might apply. So, here’s the story about what can go wrong with polls.<br />
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The most important lesson about forecasting from Brexit and the last US Presidential elections is that one cannot rely on polls to determine the future electoral outcomes. Most polls in the UK predicted a NO vote for Brexit. In the US, most polls forecasted Hillary Clinton to be the winner. Both forecasts went horribly wrong. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
When it comes to polls, one must determine who sponsored the poll, what methods were used, and how representative is the sample of the underlying population. Asking the wrong question to the right people or posing the right question to the wrong people (non-representative sample) can deliver problematic results. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Polling is as much science as it is arts. Late <a href="https://www.nytimes.com/2006/09/04/obituaries/04mitofsky.html">Warren Mitofsky</a>, who pioneered exit polls and innovated political survey research, remains a legend in political polling. His painstakingly cautious approach to polling is why he remains a respected name in market research. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Today, the advances in communication and information technologies have made survey research easier to conduct but more difficult to be precise. No longer can one rely on random digit dialling, a Mitosky innovation, to reach a representative sample. Younger cohorts sparingly subscribe to land telephone lines. The attempts to catch them online poses the risk of fishing for opinions in echo chambers.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Add political polarization to technological challenges, and one realizes the true scope of the difficulties inherent in the task of taking the political pulse of an electorate where motivated pollster may be after not the truth, but a convenient version of it.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Polls also differ by survey instrument, methodology, and sample size. The Abacus Data poll presented above is essentially an online poll of 2,326 respondents. In comparison, a poll by Mainstreet Research used Interactive Voice Response (IVR) system with a sample size of 2,350 respondents. IVR uses automated computerized responses over the telephone to record responses.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Abacus Data and Mainstreet Research use quite different methods with similar sample sizes. Professor Dan Cassino of Fairleigh Dickinson University explained the challenges with polling techniques in a 2016 article in the <a href="https://hbr.org/2016/08/how-todays-political-polling-works">Harvard Business Review</a>. He favours live telephone interviewers who “are highly experienced and college educated and paying them is the main cost of political surveys.” <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Professor Cassino believes that techniques like IVR make “polling faster and cheaper,” but these systems are hardly foolproof with lower response rates. They cannot legally reach cellphones. “IVR may work for populations of older, whiter voters with landlines, such as in some Republican primary races, but they’re not generally useful,” explained Professor Cassino. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Similarly, online polls are limited in the sense that in the US alone 16 percent Americans don’t use the Internet. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
With these caveats in mind, a plot of Mainstreet Research data reveals quite a different picture where the NDP doesn’t seem to pose an immediate and direct challenge to the PC party.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So, here’s the summary. Slopegraph is a useful tool to summarize change over time between distinct entities. Ontario is likely to have a new government on June 7. It is though, far from being certain whether the PC Party or NDP will assume office. Nevertheless, Slopergaphs generate visuals that expose the uncertainty in the forthcoming elections.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqQAymfm69EzpODk6wkKsFkcdx7oJa0psCH07GvzfmHe6_WhuHt49G0XQnymfj_h-VSQ_JEapnnaqo7_daxvb9yE9OOw-OBkhMyFPJJ8lBFAMbZ5Ti7lBfV3_iNm8-k7c_1kANA1Km4G8/s1600/Mainstreet+Research.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="772" data-original-width="704" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiqQAymfm69EzpODk6wkKsFkcdx7oJa0psCH07GvzfmHe6_WhuHt49G0XQnymfj_h-VSQ_JEapnnaqo7_daxvb9yE9OOw-OBkhMyFPJJ8lBFAMbZ5Ti7lBfV3_iNm8-k7c_1kANA1Km4G8/s400/Mainstreet+Research.png" width="363" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Note: To generate the charts in this blog, you can download data and code for Stata and Plotrix (R) by <a href="https://sites.google.com/site/statsr4us/analytics/graphics/linkplot%20and%20bumpchart2%20codes.txt">clicking HERE</a>.<o:p></o:p></div>
<br />
<div class="MsoNoSpacing">
<br /></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-67342265808616241132018-03-10T20:33:00.000-05:002018-03-12T02:25:52.258-04:00R: simple for complex tasks, complex for simple tasks<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://www.r-project.org/Rlogo.png" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" data-original-height="155" data-original-width="200" src="https://www.r-project.org/Rlogo.png" /></a></div>
<div class="MsoNoSpacing">
When it comes to undertaking complex data science projects, R is the preferred choice for many. Why? Because handling complex tasks is simpler in R than other comparable platforms.</div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Regrettably, the same is not true for performing simpler tasks, which I would argue is rather complex in base R. Hence, the title -- R: simple for complex tasks, complex for simple tasks.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Consider a simple, yet mandatory, task of generating summary statistics for a research project involving tabular data. In most social science disciplines, summary statistics for continuous variables require generating mean, standard deviation, number of observations, and perhaps minimum and maximum. One would have hoped to see a function in base R to generate such table. But there isn’t one. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Of course, several user-written packages, such as <b><span style="font-family: "courier new" , "courier" , monospace;">psyche</span></b>, can generate descriptive statistics in a tabular format. However, this requires one to have advanced knowledge of R and the capabilities hidden in specialized packages whose number now exceed 12,000 (as of March 2018). Keeping abreast of the functionality embedded in user-written packages is time-consuming. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Some would argue that the <b><span style="font-family: "courier new" , "courier" , monospace;">summary</span></b> command in base R is an option. I humbly disagree. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
First, the output from <b><span style="font-family: "courier new" , "courier" , monospace;">summary</span></b> is not in a tabular format that one could just copy and paste into a document. It would require significant processing before a simple table with summary statistics for more than one continuous variable could be generated. Second, <b><span style="font-family: "courier new" , "courier" , monospace;">summary</span></b> command does not report standard deviation.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I teach business analytics to <a href="https://www.ryerson.ca/tedrogersschool/bm/programs/real-estate-management/murtaza-haider/">undergraduate and MBA students</a>. While business students need to know statistics, they are not studying to become statisticians. Their goal in life is to be informed and proficient consumers of statistical analysis.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
So, imagine an undergraduate class with 150 students learning to generate a simple table that reports summary statistics for more than one continuous variable. The simple task requires knowledge of several R commands. By the time one teaches these commands to students, most have made up their mind to do the analysis in Microsoft Excel instead. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Had there been a simple command to generate descriptive statistics in base R, this would not be a challenge for instructors trying to bring thousands more into R’s fold.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
In the following paragraphs, I will illustrate the challenge with an example and identify an R package that generates a simple table of descriptive statistics.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
I use <b><span class="" style="font-family: "courier new" , "courier" , monospace;">mtcars</span></b> dataset, which is available with R. The following commands load the dataset and display the first few observations with all the variables.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">data(mtcars)<o:p></o:p></span></b></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">head(mtcars)</span><o:p></o:p></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
As stated earlier, one can use <b><span class="" style="font-family: "courier new" , "courier" , monospace;">summary</span></b> command to produce descriptive statistics. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">summary(mtcars)</span><o:p></o:p></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Let’s say one would like to generate descriptive statistics including mean, standard deviation, and the number of observations for the following continuous variables: <b>mpg, disp, and hp</b>. One can use the <b><span style="font-family: "courier new" , "courier" , monospace;">sapply</span></b> command and generate the three statistics separately and combined them later using the <b><span style="font-family: "courier new" , "courier" , monospace;">cbind</span></b> command.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The following command will create a vector of means.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">mean.cars = with(mtcars, sapply(mtcars[c("mpg", "disp", "hp")], mean))</span><o:p></o:p></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Note that the above syntax requires someone learning R to know the following:<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->1.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Either to attach the dataset or to use <b><span style="font-family: "courier new" , "courier" , monospace;">with</span></b> command so that <b><span style="font-family: "courier new" , "courier" , monospace;">sapply</span></b> could recognize variables.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->2.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Knowledge of subsetting variables in R<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->3.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Familiarity with <b><span style="font-family: "courier new" , "courier" , monospace;">c</span></b> to combine variables <o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->4.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Being aware of enclosing variable names in quotes<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
We can use similar syntax to determine standard deviation and the number of observations.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">sd.cars = with(mtcars, sapply(mtcars[c("mpg", "disp", "hp")], sd)); sd.cars<o:p></o:p></span></b></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">n.cars = with(mtcars, sapply(mtcars[c("mpg", "disp", "hp")], length)); n.cars</span><o:p></o:p></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Note that the user needs to know that the command for number of observations is <b><span style="font-family: "courier new" , "courier" , monospace;">length</span></b> and for standard deviation is <b><span style="font-family: "courier new" , "courier" , monospace;">sd</span></b>.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Once we have the three vectors, we can combine them using <b><span style="font-family: "courier new" , "courier" , monospace;">cbind</span></b> that generates the following table. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">cbind(n.cars, mean.cars, sd.cars)<o:p></o:p></span></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNormal" style="background: white; line-height: 11.25pt; margin-bottom: .0001pt; margin-bottom: 0cm; tab-stops: 45.8pt 91.6pt 137.4pt 183.2pt 229.0pt 274.8pt 320.6pt 366.4pt 412.2pt 458.0pt 503.8pt 549.6pt 595.4pt 641.2pt 687.0pt 732.8pt; word-break: break-all;">
<span style="border: 1pt none windowtext; font-size: 10pt; padding: 0cm;"><span style="font-family: "courier new" , "courier" , monospace;"> n.cars mean.cars sd.cars<o:p></o:p></span></span></div>
<div class="MsoNormal" style="background: white; line-height: 11.25pt; margin-bottom: .0001pt; margin-bottom: 0cm; tab-stops: 45.8pt 91.6pt 137.4pt 183.2pt 229.0pt 274.8pt 320.6pt 366.4pt 412.2pt 458.0pt 503.8pt 549.6pt 595.4pt 641.2pt 687.0pt 732.8pt; word-break: break-all;">
<span style="border: 1pt none windowtext; font-size: 10pt; padding: 0cm;"><span style="font-family: "courier new" , "courier" , monospace;">mpg 32 20.09062 6.026948<o:p></o:p></span></span></div>
<div class="MsoNormal" style="background: white; line-height: 11.25pt; margin-bottom: .0001pt; margin-bottom: 0cm; tab-stops: 45.8pt 91.6pt 137.4pt 183.2pt 229.0pt 274.8pt 320.6pt 366.4pt 412.2pt 458.0pt 503.8pt 549.6pt 595.4pt 641.2pt 687.0pt 732.8pt; word-break: break-all;">
<span style="border: 1pt none windowtext; font-size: 10pt; padding: 0cm;"><span style="font-family: "courier new" , "courier" , monospace;">disp 32 230.72188 123.938694<o:p></o:p></span></span></div>
<div class="MsoNormal" style="background: white; line-height: 11.25pt; margin-bottom: .0001pt; margin-bottom: 0cm; tab-stops: 45.8pt 91.6pt 137.4pt 183.2pt 229.0pt 274.8pt 320.6pt 366.4pt 412.2pt 458.0pt 503.8pt 549.6pt 595.4pt 641.2pt 687.0pt 732.8pt; word-break: break-all;">
<span style="border: 1pt none windowtext; font-size: 10pt; padding: 0cm;"><span style="font-family: "courier new" , "courier" , monospace;">hp 32 146.68750 68.562868</span></span><span style="font-family: "lucida console"; font-size: 10pt;"><o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Again, one needs to know the <b><span style="font-family: "courier new" , "courier" , monospace;">round</span></b> command to restrict the output to a specific number of decimals. See below the output with two decimal points.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="text-align: left;">
<b>round(cbind(n.cars, mean.cars, sd.cars),2)<o:p></o:p></b></div>
<div class="MsoNoSpacing" style="text-align: left;">
<br /></div>
<div style="background: white; line-height: 11.25pt; word-break: break-all;">
<span style="font-family: "courier new" , "courier" , monospace;"> n.cars mean.cars sd.cars</span></div>
<div style="background: white; line-height: 11.25pt; word-break: break-all;">
<span style="font-family: "courier new" , "courier" , monospace;">mpg 32 20.09 6.03</span></div>
<div style="background: white; line-height: 11.25pt; word-break: break-all;">
<span style="font-family: "courier new" , "courier" , monospace;">disp 32 230.72 123.94</span></div>
<div style="background: white; line-height: 11.25pt; word-break: break-all;">
<span style="font-family: "courier new" , "courier" , monospace;">hp 32 146.69 68.56</span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
One can indeed use a custom function to generate the same with one command. See below.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">round(with(mtcars, t(sapply(mtcars[c("mpg", "disp", "hp")], <o:p></o:p></span></b></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;"> function(x) c(n=length(x), avg=mean(x),</span></b></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;"> stdev=sd(x))</span></b><b><span style="font-family: "courier new" , "courier" , monospace;">))), 2)</span></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span style="font-family: "courier new" , "courier" , monospace;"> n avg stdev</span></div>
<div class="MsoNoSpacing">
<span style="font-family: "courier new" , "courier" , monospace;">mpg 32 20.09 6.03</span></div>
<div class="MsoNoSpacing">
<span style="font-family: "courier new" , "courier" , monospace;">disp 32 230.72 123.94</span></div>
<div class="MsoNoSpacing">
<span style="font-family: "courier new" , "courier" , monospace;">hp 32 146.69 68.56</span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
But the question I have for my fellow instructors is the following. How likely is an undergraduate student taking an introductory course in statistical analysis to be enthused about R if the simplest of the tasks need multiple lines of codes? A simple function in base R could keep students focussed on interpreting data rather than worrying about missing a comma or a parenthesis.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">stargazer</span></b>* is an R package that simplifies this task. Here is the output from <b><span class="" style="font-family: "courier new" , "courier" , monospace;">stargazer</span></b>.<o:p></o:p></div>
<div class="MsoNoSpacing">
<b style="background-color: white;"><span style="border: 1pt none windowtext; padding: 0cm;"><span style="font-family: "courier new" , "courier" , monospace;"><br /></span></span></b></div>
<div class="MsoNoSpacing">
<b style="background-color: white;"><span style="border: 1pt none windowtext; padding: 0cm;"><span style="font-family: "courier new" , "courier" , monospace;">library(stargazer)</span></span></b></div>
<pre style="background: white; line-height: 11.25pt; word-break: break-all;"><span class="gnkrckgcgsb"><b><span style="border: 1pt none windowtext; padding: 0cm;"><span style="font-family: "courier new" , "courier" , monospace;">stargazer(mtcars[c("mpg", "disp", "hp")], type="text")<o:p></o:p></span></span></b></span></pre>
<pre style="background: white; line-height: 11.25pt; word-break: break-all;"></pre>
<pre style="background: white; line-height: 11.25pt; word-break: break-all;">============================================
Statistic N Mean St. Dev. Min Max
--------------------------------------------
mpg 32 20.091 6.027 10.400 33.900
disp 32 230.722 123.939 71.100 472.000
hp 32 146.688 68.563 52 335
--------------------------------------------</pre>
<pre style="background: white; line-height: 11.25pt; word-break: break-all;"></pre>
<div class="MsoNoSpacing">
A simple task, I argue, should be accomplished simply. My plea will be to include in base R a simple command that may generate the above table with a command as simple as the one below:<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "courier new" , "courier" , monospace;">descriptives(mpg, disp, hp)</span><o:p></o:p></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<i><span style="font-size: x-small;">* Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables. </span></i><span lang="FR"><i><span style="font-size: x-small;">R package version 5.2. <a href="http://cran.r-project.org/package=stargazer">http://CRAN.R-project.org/package=stargazer</a></span></i><o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<br />
<div class="MsoNoSpacing">
<br /></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com6tag:blogger.com,1999:blog-1816488797213954066.post-51991352815757415342018-03-07T10:02:00.002-05:002018-03-07T10:02:52.670-05:00Is it time to ditch the Comparison of Means (T) Test?<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="MsoNoSpacing">
For over a century, academics have been teaching the Student’s T-test and practitioners have been running it to determine if the mean values of a variable for two groups were statistically different. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
It is time to ditch the Comparison of Means (T) Test and rely instead on the ordinary least squares (OLS) Regression. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
My motivation for this suggestion is to reduce the learning burden on non-statisticians whose goal is to find a reliable answer to their research question. The current practice is to devote a considerable amount of teaching and learning effort on statistical tests that are redundant in the presence of disaggregate data sets and readily available tools to estimate Regression models.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Before I proceed any further, I must confess that I remain a huge fan of William Sealy Gosset who introduced the T-statistic in 1908. He excelled in intellect and academic generosity. Mr. Gosset published the very <a href="http://seismo.berkeley.edu/~kirchner/eps_120/Odds_n_ends/Students_original_paper.pdf">paper that introduced the t-statistic</a> under a pseudonym, the Student. To this day, the T-test is known as the Student’s T-test. </div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
My plea is to replace the Comparison of Means (T-test) with OLS Regression, which of course relies on the T-test. So, I am not necessarily asking for ditching the T-test but instead asking to replace the Comparison of Means Test with OLS Regression.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The following are my reasons:<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->1.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Pedagogy related reasons:<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 72.0pt; mso-list: l0 level2 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->a.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Teaching Regression instead of other intermediary tests will save instructors considerable time that could be used to illustrate the same concepts with examples using Regression.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 72.0pt; mso-list: l0 level2 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->b.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Given that there are fewer than 39 hours of lecture time available in a single semester introductory course on applied statistics, much of the course is consumed in covering statistical tests that would be redundant if one were to introduce Regression models sooner in the course.</div>
<div class="MsoNoSpacing" style="margin-left: 72.0pt; mso-list: l0 level2 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->c.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Academic textbooks for undergraduate students in business, geography, psychology dedicate huge sections to a battery of tests that are redundant in the presence of Regression models.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 108.0pt; mso-list: l0 level3 lfo1; mso-text-indent-alt: -9.0pt; text-indent: -108.0pt;">
<!--[if !supportLists]--><span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span>i.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Consider the widely used textbook <i>Business Statistics</i> by Ken Black that requires students and instructors to leaf through 500 pages before OLS Regression is introduced.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 108.0pt; mso-list: l0 level3 lfo1; mso-text-indent-alt: -9.0pt; text-indent: -108.0pt;">
<!--[if !supportLists]--><span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span>ii.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->The learning requirements of undergraduate and graduate students not enrolled in economics, mathematics, or statistics programs are quite different. Yet most textbooks and courses attempt to turn all students into professional statisticians.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 36.0pt; mso-list: l0 level1 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->2.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Applied Analytics reasons<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 72.0pt; mso-list: l0 level2 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->a.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->OLS Regression model with a continuous dependent variable and a dichotomous explanatory variable produces the exact same output as the standard Comparison of Means Test.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 72.0pt; mso-list: l0 level2 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->b.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->Extending the comparison to more than two groups is a straightforward extension in Regression where the dependent variable will comprise more than two groups. <o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 108.0pt; mso-list: l0 level3 lfo1; mso-text-indent-alt: -9.0pt; text-indent: -108.0pt;">
<!--[if !supportLists]--><span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span>i.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->In the tradition statistics teaching approach, one advises students that the T-test is not valid to compare the means for more than two groups and that we must switch to learning a new method, ANOVA.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 108.0pt; mso-list: l0 level3 lfo1; mso-text-indent-alt: -9.0pt; text-indent: -108.0pt;">
<!--[if !supportLists]--><span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span>ii.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->You might have caught on my drift that I am also proposing to replace teaching ANOVA in introductory statistics courses with OLS Regression.<o:p></o:p></div>
<div class="MsoNoSpacing" style="margin-left: 72.0pt; mso-list: l0 level2 lfo1; text-indent: -18.0pt;">
<!--[if !supportLists]-->c.<span style="font-size: 7pt; font-stretch: normal; font-variant-east-asian: normal; font-variant-numeric: normal; line-height: normal;"> </span><!--[endif]-->A Comparison of Means Test illustrated as a Regression model is much easier to explain than explaining the output from a conventional T-test. <o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
After introducing Normal and T-distributions, I would, therefore, argue that instructors should jump straight to Regression models.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<h2 style="text-align: left;">
<o:p> </o:p><b>Is Regression a valid substitute for T-tests?</b></h2>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
In the following lines, I will illustrate that the output generated by OLS Regression models and a Comparison of Means test are identical. I will illustrate examples using Stata and R.</div>
<div class="MsoNoSpacing">
<br /></div>
<h3 style="text-align: left;">
<b>Dataset</b></h3>
<div class="MsoNoSpacing">
I will use Professor Daniel Hamermesh’s data on teaching ratings to illustrate the concepts. In a popular paper, <a href="https://sites.google.com/site/statsr4us/intro/software/rcmdr-1/Beautyandteacherratings.pdf?attredirects=0&d=1">Professor Hamermesh and Amy Parker</a> explore whether good looking professors receive higher teaching evaluations. The dataset comprises teaching evaluation score, beauty score, and instructor/course related metrics for 463 courses and is available for download in R, Stata, and Excel formats at: </div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<a href="https://sites.google.com/site/statsr4us/intro/software/rcmdr-1">https://sites.google.com/site/statsr4us/intro/software/rcmdr-1</a></div>
<div class="MsoNoSpacing">
<br /></div>
<h3 style="text-align: left;">
<b>Hypothetically Speaking</b></h3>
<div class="MsoNoSpacing">
Let us test the hypothesis that the average beauty score for male and female instructors is statistically different. The average (normalized) beauty score for male instructors was -0.084 for male instructors and 0.116 for female instructors. The following Box Plot illustrates the difference in beauty scores. The question we would like to answer is whether this apparent difference is statistically significant.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBsC9la-Ys4RYsJZnxZgktxhRzTwCFM25h5e926NRdwgeRE-JlNStQTXBsj2U2AKKM8WPiNB2oQc_iPCzJSBeMi8GfLl7setpjO8SvMK77hKrOrV7VnVbCnkTvQeJphxcS_YiCJ3bN58E/s1600/2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="657" data-original-width="743" height="353" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBsC9la-Ys4RYsJZnxZgktxhRzTwCFM25h5e926NRdwgeRE-JlNStQTXBsj2U2AKKM8WPiNB2oQc_iPCzJSBeMi8GfLl7setpjO8SvMK77hKrOrV7VnVbCnkTvQeJphxcS_YiCJ3bN58E/s400/2.png" width="400" /></a></div>
<div>
<br /></div>
<div>
I first illustrate the Comparison of Means Test and OLS Regression assuming Equal Variances in <a href="http://www.stata.com/">Stata</a>.</div>
<div>
<br /></div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<h3 style="text-align: left;">
<o:p> </o:p><b>Assuming Equal Variances (Stata)</b></h3>
<div>
<b><br /></b></div>
<div class="MsoNoSpacing">
<b>Download data</b></div>
<div class="MsoNoSpacing">
<b><o:p></o:p></b></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;"> use "https://sites.google.com/site/statsr4us/intro/software/rcmdr-1/teachingratings.dta" <o:p></o:p></span></b></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b style="text-indent: 36pt;"><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">encode gender, gen(sex) // To convert a character variable into a numeric variable.</span></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The T-test is conducted using:<br />
<!--[if !supportLineBreakNewLine]--><br />
<!--[endif]--><o:p></o:p></div>
<div class="MsoNoSpacing">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;"> ttest beauty, by(sex)<o:p></o:p></span></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The above generates the following output:</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSqWBPDtQ_CmYll52JgCWYNE31oSepcHktaXQL8bmA52wdWAE0qNgLexv3Kz8FVcreW7NXgng5fkUJilCeNlj2u22LVG-BD2DwMenNKwlmBBpUGuxp9HH4qtW7W7hQGjf61YN9LePAUrU/s1600/3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="302" data-original-width="642" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSqWBPDtQ_CmYll52JgCWYNE31oSepcHktaXQL8bmA52wdWAE0qNgLexv3Kz8FVcreW7NXgng5fkUJilCeNlj2u22LVG-BD2DwMenNKwlmBBpUGuxp9HH4qtW7W7hQGjf61YN9LePAUrU/s1600/3.png" /></a></div>
<div>
<br /></div>
<div class="MsoNoSpacing">
As per the above estimates, the average beauty score of female instructors is 0.2 points higher and the t-test value is 2.7209. We can generate the same output by running an OLS regression model using the following command:</div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">reg beauty i.sex<o:p></o:p></span></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The regression model output is presented below.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwK7qhothSiF7-fcM28aWORokO82d_kQWp258Aesy1xJja2a0k_s-uoge1LNMJPzaO26-hDkjohb6_lYsVTeJNs0uJ5mfAUiknVa1gqaxUU_aakxOjYufOU5tyC3yfgBZDDOLIAKxgRvk/s1600/4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="264" data-original-width="634" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgwK7qhothSiF7-fcM28aWORokO82d_kQWp258Aesy1xJja2a0k_s-uoge1LNMJPzaO26-hDkjohb6_lYsVTeJNs0uJ5mfAUiknVa1gqaxUU_aakxOjYufOU5tyC3yfgBZDDOLIAKxgRvk/s1600/4.png" /></a></div>
<div>
<br /></div>
<div>
Note that the average beauty score for male instructors is -0.2 points lower than that of females and the associated standard errors and t-values (highlighted in yellow) are identical to the ones reported in the Comparison of Means test. </div>
<div>
<br /></div>
<h3 style="text-align: left;">
<b>Unequal Variances</b></h3>
<div>
<b><br /></b></div>
<div class="MsoNoSpacing">
But what about unequal variances? Let us first conduct the t-test using the following syntax:</div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">ttest beauty, by(sex) unequal<o:p></o:p></span></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The output is presented below:<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7OasOO_ZZzK9HLS5q7HKKgDXMTyvi4Qqo2hvrREmxR5Unf9053NaYV_w0sltSJcjhdwgmlowckZrW903VlVM-DaMXdPONTaH1EOm7gfs64-9mBEjhdwe_8q3lgvFXd456HzHsCJFtisw/s1600/5.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="297" data-original-width="644" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi7OasOO_ZZzK9HLS5q7HKKgDXMTyvi4Qqo2hvrREmxR5Unf9053NaYV_w0sltSJcjhdwgmlowckZrW903VlVM-DaMXdPONTaH1EOm7gfs64-9mBEjhdwe_8q3lgvFXd456HzHsCJFtisw/s1600/5.png" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<!--[if gte vml 1]><v:shape
id="Picture_x0020_4" o:spid="_x0000_i1028" type="#_x0000_t75" style='width:312pt;
height:144.75pt;visibility:visible;mso-wrap-style:square' o:bordertopcolor="yellow pure"
o:borderleftcolor="yellow pure" o:borderbottomcolor="yellow pure"
o:borderrightcolor="yellow pure">
<v:imagedata src="file:///C:/Users/murtaza/AppData/Local/Temp/msohtmlclip1/01/clip_image009.png"
o:title="" cropleft="452f"/>
<w:bordertop type="single" width="6"/>
<w:borderleft type="single" width="6"/>
<w:borderbottom type="single" width="6"/>
<w:borderright type="single" width="6"/>
</v:shape><![endif]--><!--[if !vml]--><!--[endif]--><o:p></o:p></div>
<div class="MsoNoSpacing">
Note the slight change in standard error and the associated t-test.</div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
To replicate the same results with a Regression model, we need to run a different Stata command that estimates a <a href="https://www.stata.com/manuals13/rvwls.pdf">variance weighted least squares regression</a><span class="MsoHyperlink">. U</span>sing Stata’s <b><u>vwls</u></b> command:<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">vwls beauty i.sex<o:p></o:p></span></b></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;"><br /></span></b></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRR_YfxzsUyjiT4DAqXlAge7Jt_ZTrLjoqtnIalFEyw2nfwGjEiR4Rv7MBfbUaaMCVfDpRYWtffU6BIElRn_ZmxLJ-vUwVUoQt2ODyB6i0SS9_nlMFdRs0dtu7bN8mWYBqLp0NjrGqKi4/s1600/6.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="231" data-original-width="637" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiRR_YfxzsUyjiT4DAqXlAge7Jt_ZTrLjoqtnIalFEyw2nfwGjEiR4Rv7MBfbUaaMCVfDpRYWtffU6BIElRn_ZmxLJ-vUwVUoQt2ODyB6i0SS9_nlMFdRs0dtu7bN8mWYBqLp0NjrGqKi4/s1600/6.png" /></a></div>
<div>
<br /></div>
<div>
Note that the last two outputs are identical. </div>
<div>
<br /></div>
<h2 style="text-align: left;">
<b>Repeating the same analysis in R</b></h2>
<div>
<b><br /></b></div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
To download data in R, use the following syntax:</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">url = "https://sites.google.com/site/statsr4us/intro/software/rcmdr-1/TeachingRatings.rda"<o:p></o:p></span></b></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b style="text-indent: 36pt;"><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">download.file(url,"TeachingRatings.rda")</span></b></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b style="text-indent: 36pt;"><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">load("TeachingRatings.rda")</span></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
For equal variances, the following syntax is used for T-test and the OLS regression model.<o:p></o:p></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">t.test(beauty ~ gender, var.equal=T)<o:p></o:p></span></b></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">summary(lm(beauty ~ gender))<o:p></o:p></span></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The above generates the following identical output as Stata.<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEga8oL4RVq54G2fnPdiF-_HYxcb8gQ_5xXPgFepHeZnyfKPi1Jg-FmJnGC-9eD5TGU-BOMqdj1kqYhA2E1xEOiD-iBoa_r231gAqikajB8akcqqAuyEaMq-PqIiCZcnIjjJwsJyqZn6Q28/s1600/7.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="491" data-original-width="551" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEga8oL4RVq54G2fnPdiF-_HYxcb8gQ_5xXPgFepHeZnyfKPi1Jg-FmJnGC-9eD5TGU-BOMqdj1kqYhA2E1xEOiD-iBoa_r231gAqikajB8akcqqAuyEaMq-PqIiCZcnIjjJwsJyqZn6Q28/s1600/7.png" /></a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
For unequal variances, we need to install and load the <b>nlme </b>package to run a <b>gls</b> version of the variance weighted least square Regression model.</div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">t.test(beauty ~ gender)<o:p></o:p></span></b></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">install.packages(“nlme”)<o:p></o:p></span></b></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">library(nlme)<o:p></o:p></span></b></div>
<div class="MsoNoSpacing" style="text-indent: 36.0pt;">
<b><span style="font-family: "Courier New"; font-size: 8.0pt; mso-bidi-font-size: 11.0pt;">summary(gls(beauty ~ gender, weights=varIdent(form = ~ 1 | gender)))<o:p></o:p></span></b></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The above generates the following output:<o:p></o:p></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCsVHxFe2HE3uwsXlejmAcLjWVKdDsmPVyg_bdgiJL2_JNLFZiV1M0HjlHXwcHjkuRDY3689C2DAQ6oEj2fKYImNTBXf_5-OpeJJU38p26PdU0RgwSEDccqvMs_m5yZIkWngF5L2gG4l4/s1600/8.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="506" data-original-width="583" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjCsVHxFe2HE3uwsXlejmAcLjWVKdDsmPVyg_bdgiJL2_JNLFZiV1M0HjlHXwcHjkuRDY3689C2DAQ6oEj2fKYImNTBXf_5-OpeJJU38p26PdU0RgwSEDccqvMs_m5yZIkWngF5L2gG4l4/s1600/8.png" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
So there we have it, OLS Regression is an excellent substitute for the Comparison of Means test.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<o:p></o:p></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-44064019255356785692018-01-28T12:30:00.000-05:002018-01-28T12:30:42.791-05:00When it comes to Amazon's HQ2, you should be careful what you wish for<div dir="ltr" style="text-align: left;" trbidi="on">
By Murtaza Haider and Stephen Moranis<br />
<span style="font-size: x-small;"><i>Note: This article originally appeared in the <a href="http://business.financialpost.com/real-estate/haider-moranis-bulletin-when-it-comes-to-amazons-hq2-toronto-should-be-careful-what-it-wishes-for" target="_blank">Financial Post</a> on January 25, 2018</i></span><br />
<br /><div>
<div class="story-content" itemprop="articleBody" style="background-color: white; border: 0px; box-sizing: inherit; font-family: Helvetica, Arial, sans-serif; font-size: 15px; font-stretch: inherit; font-variant-east-asian: inherit; font-variant-numeric: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
Amazon.com Inc. has turned the search for a home for its second headquarters (HQ2) into an episode of The Bachelorette, with cities across North America trying to woo the online retailer.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
The Seattle-based tech giant has narrowed down the choice to 20 cities, with Toronto being the only Canadian location in the running.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
While many in Toronto, including its mayor, are hoping to be the ideal suitor for Amazon HQ2, one must be mindful of the challenges such a union may pose.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
Amazon announced in September last year that its second headquarters will employ 50,000 high-earners with an average salary of US$100,000. It will also require 8 million square feet (SFT) of office and commercial space.</div>
<ul class="related_links" style="-webkit-font-smoothing: none; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-variant: inherit; font-weight: inherit; line-height: normal; list-style: none; margin: 0px; padding: 20px 0px 30px; vertical-align: baseline;">
<li style="border-bottom: 0px; border-image: initial; border-left: 0px; border-right: 0px; border-top: none; box-sizing: inherit; font-family: inherit; font-size: inherit; font-stretch: inherit; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; padding: 10px 0px; vertical-align: baseline;"><a href="http://business.financialpost.com/technology/why-toronto-may-have-a-shot-at-amazon-hq2-hint-trumps-anti-immigration-stance" style="border: 0px; box-sizing: inherit; font-family: BentonSansCond, sans-serif; font-stretch: inherit; font-variant: inherit; font-weight: inherit; line-height: 21px; margin: 0px; padding: 0px; text-decoration-line: none; vertical-align: baseline;"><i><span style="color: #0b5394;">Why Toronto may have a shot at Amazon HQ2 (Hint: Trump’s anti-immigration stance)</span></i></a></li>
<li style="border-bottom-color: initial; border-bottom-style: initial; border-image: initial; border-left-color: initial; border-left-style: initial; border-right-color: initial; border-right-style: initial; border-top-color: rgb(221, 221, 221); border-top-style: dotted; border-width: 2px 0px 0px; box-sizing: inherit; font-family: inherit; font-size: inherit; font-stretch: inherit; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; padding: 10px 0px; vertical-align: baseline;"><a href="http://business.financialpost.com/technology/toronto-on-shortlist-for-amazons-new-headquarters-hq2" style="border: 0px; box-sizing: inherit; font-family: BentonSansCond, sans-serif; font-stretch: inherit; font-variant: inherit; font-weight: inherit; line-height: 21px; margin: 0px; padding: 0px; text-decoration-line: none; vertical-align: baseline;"><i><span style="color: #0b5394;">Toronto only Canadian city to make shortlist for Amazon’s new headquarters HQ2</span></i></a></li>
<li style="border-bottom-color: initial; border-bottom-style: initial; border-image: initial; border-left-color: initial; border-left-style: initial; border-right-color: initial; border-right-style: initial; border-top-color: rgb(221, 221, 221); border-top-style: dotted; border-width: 2px 0px 0px; box-sizing: inherit; font-family: inherit; font-size: inherit; font-stretch: inherit; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; padding: 10px 0px; vertical-align: baseline;"><a href="http://business.financialpost.com/technology/canadian-cities-reveal-how-they-feel-about-losing-bid-for-amazon-hq2" style="border: 0px; box-sizing: inherit; font-family: BentonSansCond, sans-serif; font-stretch: inherit; font-variant: inherit; font-weight: inherit; line-height: 21px; margin: 0px; padding: 0px; text-decoration-line: none; vertical-align: baseline;"><i><span style="color: #0b5394;">Here’s how the Canadian cities who placed unsuccessful bids for Amazon HQ2 feel about losing out</span></i></a></li>
</ul>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
A capacity-constrained city with a perennial shortage of affordable housing and limited transport capacity, Toronto may be courting trouble by pursuing thousands of additional highly-paid workers. If you think housing prices and rents are unaffordable now, wait until the Amazon code warriors land to fight you for housing or a seat on the subway.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
The tech giants do command a much more favourable view in North America than they do in Europe. Still, their reception varies, especially in the cities where these firms are domiciled. Consider San Francisco, which is home to not one but many tech giants and ever mushrooming startups. The city attracts high-earning tech talent from across the globe to staff innovative labs and R&D departments.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
These highly paid workers routinely outbid locals and other workers in housing and other markets. No longer can one ask for a conditional sale offer that is subject to financing because a 20-something whiz kid will readily pay cash to push other bidders aside.</div>
<img height="637" src="http://wpmedia.business.financialpost.com/2018/01/fp0125_haidermoranis.png?w=640&h=637" style="border: 0px; box-sizing: inherit; font-family: inherit; font-size: inherit; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; height: auto !important; line-height: inherit; margin: 0px; max-width: 100%; padding: 0px; vertical-align: baseline;" width="640" /><div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
We wonder whether Toronto’s residents, or those of whichever city ultimately wins Amazon’s heart, will face the same competition from Amazon employees as do the residents of Seattle? The answer lies in the relative affordability gap.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
Amazon employees with an average income of US$100,000 will compete against Toronto residents whose individual median income in 2015 was just $30,089. It is quite likely that the bidding wars that high-earning tech workers have won hands down in other cities will end in their favour in the city chosen for Amazon HQ2.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
While we are mindful of the challenges that Amazon HQ2 may pose for a capacity-constrained Toronto, we are also alive to the opportunities it will present. For starters, Toronto can use 50,000 high-paying jobs.</div>
<h3 style="border: 0px; box-sizing: inherit; color: #666666; font-family: MillerDisplayItalic, serif; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; letter-spacing: 2px; line-height: 20px; margin: 30px 0px 10px; padding: 0px; text-transform: uppercase; vertical-align: baseline;">
GIG ECONOMY</h3>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
The emergence of the gig economy has had an adverse impact in the City of Toronto, where the employment growth has largely concentrated in the part-time category. Between 2006 and 2016, full-time jobs grew by a mere 8.7 per cent in Toronto, while the number of part-time jobs grew at four times that rate.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
While being the largest employment hub in Canada, with an inventory of roughly 180 million square feet, an influx of 8 million square feet of first-rate office space will improve the overall quality of commercial real estate in Toronto. It could also be a boon for office construction and a significant source of new property tax revenue for the city.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
But those hoping the city itself might make money should seriously consider the fate of cities lucky enough to host the Olympics, which more often than not end up costing cities billions more than they budgeted for.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
Toronto may still pursue Amazon HQ2, but it should do so with the full knowledge of its strengths and vulnerabilities. At the very least, it should create contingency plans to address the resulting infrastructure deficit (not just public transit) and housing affordability issues before it throws open its doors for Amazon.</div>
<div style="-webkit-font-smoothing: antialiased; border: 0px; box-sizing: inherit; font-size: 17px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 30px; margin-bottom: 20px; padding: 0px; vertical-align: baseline;">
<em style="border: 0px; box-sizing: inherit; font-family: inherit; font-size: inherit; font-stretch: inherit; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">Murtaza Haider is an associate professor at Ryerson University. Stephen Moranis is a real estate industry veteran. They can be reached at <a href="https://mail.google.com/mail/?view=cm&fs=1&tf=1&to=info@hmbulletin.com" style="border-bottom-color: rgb(46, 78, 191); border-bottom-style: solid; border-image: initial; border-left-color: initial; border-left-style: initial; border-right-color: initial; border-right-style: initial; border-top-color: initial; border-top-style: initial; border-width: 0px 0px 2px; box-sizing: inherit; color: black; font-family: inherit; font-size: inherit; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; padding: 0px; text-decoration-line: none; vertical-align: baseline; word-wrap: break-word;" target="_blank">info<span class="mentions-prefix" style="border: 0px; box-sizing: inherit; font-family: inherit; font-size: inherit; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">@</span>hmbulletin.com</a>.</em></div>
<div>
<em style="border: 0px; box-sizing: inherit; font-family: inherit; font-size: inherit; font-stretch: inherit; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;"><br /></em></div>
</div>
<div class="clearfix" style="background-color: white; border: 0px; box-sizing: inherit; clear: both; font-family: Helvetica, Arial, sans-serif; font-size: 15px; font-stretch: inherit; font-variant-east-asian: inherit; font-variant-numeric: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">
</div>
<div class="sub-section" style="background-color: white; border: 0px; box-sizing: inherit; clear: both; font-family: Helvetica, Arial, sans-serif; font-size: 15px; font-stretch: inherit; font-variant-east-asian: inherit; font-variant-numeric: inherit; line-height: inherit; margin: 20px 0px; padding: 0px; vertical-align: baseline;">
<div class="shortcut-comment" style="border: 0px; box-sizing: inherit; display: inline-block; font-family: inherit; font-size: inherit; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;">
<a href="http://business.financialpost.com/real-estate/haider-moranis-bulletin-when-it-comes-to-amazons-hq2-toronto-should-be-careful-what-it-wishes-for#comments-area" style="border: 0px; box-sizing: inherit; color: #404040; float: left; font-size: 13px; font-stretch: inherit; font-style: inherit; font-variant: inherit; font-weight: inherit; line-height: 16px; margin: 0px; padding: 0px; text-decoration-line: none; vertical-align: baseline;"></a></div>
</div>
</div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-67365799728945205852017-06-16T12:00:00.000-04:002017-06-16T12:00:45.896-04:00Did the cold weather put a chill on Toronto’s housing market?<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="MsoNormal">
<span lang="EN-CA">Toronto’s
housing market took a dive in May. After years of record highs in housing sales
and prices, the hype seems to have evaporated. While some link the slowdown to the
Ontario government’s legislation to tighten lending in housing markets, one
should also factor in the unusually cold, dark, and wet weather in May that
felt more like a ‘May-be.'</span><o:p></o:p></div>
<div class="MsoNormal">
<span lang="EN-CA"> </span><o:p></o:p></div>
<div class="MsoNormal">
<span lang="EN-CA">Housing
sales in the greater Toronto area (GTA) were down 23% last month from a year
earlier. However, the average sales price was 14.8% higher than the price in
May 2016. On a month-by-month basis, housing prices in May were down by 6% than
the prices in April.</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">The
declining numbers have alarmed homebuyers, sellers, brokerages, and
governments. Many are questioning if the Ontario government’s intervention has
a more adverse impact than was intended. Homebuyers, who have not yet closed on
properties, are wondering whether they have paid too much, while sellers are
rushing to list properties to benefit from high housing prices that appear to
be past their peak.</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">While it is
too early to determine the ‘causal impact’ of the legislative changes
introduced in April, which included a 15% tax on foreign home buyers, one must
also consider other mitigating factors that might have affected Toronto’s
housing market. We must even consider the influence of the weather. </span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">The
unusually cold weather in May might have had a chilling effect on housing
sales. Typically, housing markets start to heat up in April while being in
synch with the rising temperatures. May 2017 was unusually wet. Toronto
received a total of 157 mm of precipitation last month compared to 25 mm a year
ago. The unusually high rainfall caused flooding all across Ontario. In
downtown Toronto, Lake Ontario </span><a href="https://www.theglobeandmail.com/news/toronto/lake-ontario-flooding-seeps-into-downtown-toronto-condo-buildings/article35270018/"><span lang="EN-CA">water rushed into lakefront condos</span></a><span lang="EN-CA">.
At the same time, May 2017 was unusually cooler than last year. The
average temperature last month was 12 degrees Celsius compared to 16 degrees in
May 2016. May was also unusually dark with much less sunshine. However, Toronto saw this trend since January
2017 when it received a mere 50 hours of sunlight compared to the seasonal
average of 85 hours. <o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjW19txyOuGRGby28oxveR_C8MwieCBf92w9ekxwX0Pe7wwrrz0e6mxC7ERjjGOP4n8KR79XTmnRX6q4BPSxCBXex8JI49NE-zSdxpA4X1oNiONOU5A5n1UwPjIq3oM1NvQQ4ahNe-0T4Q/s1600/rain.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="288" data-original-width="481" height="238" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjW19txyOuGRGby28oxveR_C8MwieCBf92w9ekxwX0Pe7wwrrz0e6mxC7ERjjGOP4n8KR79XTmnRX6q4BPSxCBXex8JI49NE-zSdxpA4X1oNiONOU5A5n1UwPjIq3oM1NvQQ4ahNe-0T4Q/s400/rain.PNG" width="400" /></a></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA"> </span><span lang="EN-CA">So why
should an unusually cold, dark, and wet weather have any impact on housing
markets? Research has shown that weather and atmosphere influence consumer behavior. Retail experts call this phenomenon ‘</span><a href="http://onlinelibrary.wiley.com/doi/10.1002/mar.20709/abstract" target="_blank"><span lang="EN-CA">store
atmospherics</span></a><span lang="EN-CA">’ where
a store’s environment is altered to enhance consumer behavior that may promote sales. It applies to housing markets as
well. </span><a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2598559" target="_blank"><span lang="EN-CA">Researchers
discovered</span></a><span lang="EN-CA"> that
adverse weather has a significant, yet the short-term
effect on economic activity. Writing in </span><a href="http://onlinelibrary.wiley.com/doi/10.1111/1540-6229.00408/full" target="_blank"><i><span lang="EN-CA">Real Estate
Economics</span></i></a><span lang="EN-CA">, John
Goodman Jr. found a slight adverse impact of unseasonable weather on housing
markets. In related work, researchers found that </span><a href="http://www.nber.org/papers/w18212" target="_blank"><span lang="EN-CA">sale prices of homes with central
air-conditioning</span></a><span lang="EN-CA"> and
swimming pools are higher for sales recorded in summer months.</span></div>
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">There are
other factors to consider in assessing the market dip. The Ontario government’s
regulations to tighten housing markets could have encouraged some homebuyers to
advance their purchase to avoid uncertainty. The government’s plans to impose
new restrictions on housing markets were known
in advance of their announcement in April. Investors are risk and uncertainty
averse. Hence some homebuyers could have advanced their purchase to March when
the sales unexpectedly jumped by 50% over February 2017. As for those who could
not advance their purchase to March, they may have decided to sit through this
confusion and wait for calmer markets to prevail.</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">In </span><a href="http://munkschool.utoronto.ca/imfg/uploads/351/imfgpaper_no28_landtransfer_haider_anwar_holmes_july_28_2016.pdf" target="_blank"><span lang="EN-CA">earlier
research</span></a><span lang="EN-CA">, we
documented a similar trend for housing sales in Toronto, when sales escalated
in 2007 in advance of Toronto’s new land transfer tax, which was implemented in February 2008. The
additional sales recorded in 2007 meant that fewer sales were realized in 2008. The sales activity
returned to the long-term trends in a couple of years. <o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">And if</span><span lang="EN-CA"> this was
not enough, financial troubles at the alternative mortgage lender, Home Capital,
spooked borrowers who were not deemed
mortgage worthy by the mainstream Canadian banks. Many real estate
professionals believe the cumulative effect of unseasonal weather, tightening
of mortgage regulations, and troubles at alternative lenders were likely the reason behind the declining
housing sales and prices.</span><o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">The roof is
not collapsing on Toronto’s housing market. The decline in sales and prices is
a rational response by homebuyers and sellers who are reacting to Ontario
government’s initiatives to tighten lending in housing markets. The cold, dark,
and wet weather certainly did not help either.</span></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-15942163531879178632016-09-14T21:09:00.000-04:002016-09-14T21:09:05.970-04:00Data Science 101, now online<div dir="ltr" style="text-align: left;" trbidi="on">
We are delighted to note that IBM's <a href="http://bigdatauniversity.com/">BigDataUniversity.com</a> has launched the quintessential introductory course on data science aptly named <a href="https://bigdatauniversity.com/courses/data-science-101/">Data Science 101.</a><br />
<br />
The target audience for the cours<span style="background-color: white;">e is th</span>e uninitiated cohort that is curious about data science and would like to take the baby steps to a career in data and analytics. Needless to say, the course is for absolute beginners.<br />
<br />
To get a taste of the course, watch the following video "<a href="https://www.youtube.com/watch?v=z1kPKBdYks4" target="_blank">What is Data Science?</a><br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<iframe width="320" height="266" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/z1kPKBdYks4/0.jpg" src="https://www.youtube.com/embed/z1kPKBdYks4?feature=player_embedded" frameborder="0" allowfullscreen></iframe></div>
<br />
<b>Here is the curriculum:</b><br />
<br />
<ul style="box-sizing: inherit; color: #373a3c; font-family: "Source Sans Pro", sans-serif; font-size: 16px; line-height: 24px; margin-bottom: 1rem; margin-top: 0px;">
<li style="box-sizing: inherit;"><span style="box-sizing: inherit; font-weight: 700;">Module 1 - Defining Data Science</span><ul style="box-sizing: inherit; margin-bottom: 0px; margin-top: 0px;">
<li class="ace-line" id="magicdomid21" style="box-sizing: inherit;"><span class="author-286633433 font-color-3c3c3c font-size-medium" style="box-sizing: inherit;">What is data science?</span></li>
<li class="ace-line" id="magicdomid22" style="box-sizing: inherit;"><span class="author-286633433 font-color-3c3c3c" style="box-sizing: inherit;">There are many paths to data science</span></li>
<li class="ace-line" id="magicdomid23" style="box-sizing: inherit;"><span class="author-286633433 font-color-3c3c3c font-size-medium" style="box-sizing: inherit;">Any advice for a new data scientist?</span></li>
<li class="ace-line" id="magicdomid24" style="box-sizing: inherit;"><span class="author-286633433 font-color-3c3c3c font-size-medium" style="box-sizing: inherit;">What is the cloud?</span></li>
<li class="ace-line" id="magicdomid25" style="box-sizing: inherit;"><span class="author-286633433 font-color-3c3c3c font-size-medium font-size-small" style="box-sizing: inherit;">"Data Science: The Sexiest Job in the 21st Century"</span></li>
</ul>
</li>
<li style="box-sizing: inherit;"><span style="box-sizing: inherit; font-weight: 700;">Module 2 - What do data science people do?</span><ul style="box-sizing: inherit; margin-bottom: 0px; margin-top: 0px;">
<li style="box-sizing: inherit;">A day in the life of a data science person</li>
<li style="box-sizing: inherit;">R versus Python?</li>
<li style="box-sizing: inherit;">Data science tools and technology</li>
<li style="box-sizing: inherit;">"Regression"</li>
</ul>
</li>
<li style="box-sizing: inherit;"><span style="box-sizing: inherit; font-weight: 700;">Module 3 - Data Science in Business</span><ul style="box-sizing: inherit; margin-bottom: 0px; margin-top: 0px;">
<li style="box-sizing: inherit;">How should companies get started in data science?</li>
<li style="box-sizing: inherit;">R versus Python</li>
<li style="box-sizing: inherit;"><div class="ace-line" id="magicdomid37" style="box-sizing: inherit;">
<span class="author-286633433 font-color-3c3c3c font-size-medium" style="box-sizing: inherit;">Tips for recruiting data science people</span></div>
</li>
<li style="box-sizing: inherit;"><div class="ace-line" id="magicdomid37" style="box-sizing: inherit;">
<span class="author-286633433 font-color-3c3c3c font-size-medium font-size-small" style="box-sizing: inherit;">"The Final Deliverable"</span></div>
</li>
</ul>
</li>
<li style="box-sizing: inherit;"><span style="box-sizing: inherit; font-weight: 700;">Module 4 - Use Cases for Data Science</span><ul style="box-sizing: inherit; margin-bottom: 0px; margin-top: 0px;">
<li style="box-sizing: inherit;">Applications for data science</li>
<li style="box-sizing: inherit;">"The Report Structure"</li>
</ul>
</li>
<li style="box-sizing: inherit;"><span style="box-sizing: inherit; font-weight: 700;">Module 5 -Data Science People</span><ul style="box-sizing: inherit; margin-bottom: 0px; margin-top: 0px;">
<li style="box-sizing: inherit;">Things data science people say</li>
<li style="box-sizing: inherit;">"What Makes Someone a Data Scientist?"</li>
</ul>
</li>
</ul>
Want to learn more about IBM's Big Data University, <a href="https://bigdatauniversity.com/about-us/" target="_blank">Click HERE</a>.<br />
<br /></div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-89250621128944766832016-09-01T16:30:00.006-04:002016-09-01T16:53:01.374-04:00The X-Factors: Where 0 means 1<div dir="ltr" style="text-align: left;" trbidi="on">
Hadley Wickham in <a href="https://www.r-bloggers.com/forcats-0-1-0-%F0%9F%90%88%F0%9F%90%88%F0%9F%90%88%F0%9F%90%88/">a recent blog post</a> mentioned that "Factors have a bad rap in R because they often turn up when you don’t want them." I believe Factors are an even bigger concern. They not only turn up where you don't want them, but they also turn things around when you don't want them to.<br />
<br />
Consider the following example where I present a data set with two variables: <b><span style="font-family: "courier new" , "courier" , monospace;">x </span></b>and <b><span style="font-family: "courier new" , "courier" , monospace;">y</span></b>. I represent age in years as '<b><span style="font-family: "courier new" , "courier" , monospace;">y</span></b>' and gender as a binary (0/1) variable as '<b><span style="font-family: "courier new" , "courier" , monospace;">x</span></b>' where 1 represents males.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnRsgaFXHlc55WPaqsbOkG6ydoa80XExSDcnQeKtKjkAgUu3l2wfZGaiEOjPgWy39yMaX2A81hxJ4ymAEccDivflTuY0FZWLrzUv-sBEBa59DP-1h8JTuqfcXVa-p4D6Wi24YqDVCKbS8/s1600/Fig.1.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhnRsgaFXHlc55WPaqsbOkG6ydoa80XExSDcnQeKtKjkAgUu3l2wfZGaiEOjPgWy39yMaX2A81hxJ4ymAEccDivflTuY0FZWLrzUv-sBEBa59DP-1h8JTuqfcXVa-p4D6Wi24YqDVCKbS8/s1600/Fig.1.PNG" /></a></div>
I compute the means for the two variables as follows:<br />
<div>
<br /></div>
<div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8ZUStb3zbmCy8P7eXK-tJyT9QsIc6DA21fhZRnfRFzCrvxpQUxoaPZ8YI04A4YFII4jL2cyFr6DWLB9KhsgT9-7sZ-GckmxL3NBQXZt7w4cO9dtruf4ghWW4bSsVB-6k2fvyznyZQyjc/s1600/Fig.2.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8ZUStb3zbmCy8P7eXK-tJyT9QsIc6DA21fhZRnfRFzCrvxpQUxoaPZ8YI04A4YFII4jL2cyFr6DWLB9KhsgT9-7sZ-GckmxL3NBQXZt7w4cO9dtruf4ghWW4bSsVB-6k2fvyznyZQyjc/s1600/Fig.2.PNG" /></a></div>
<br /></div>
<div>
The average age is 43.6 years, and 0.454 suggests that 45.4% of the sample comprises males. So far so good. </div>
<div>
<br /></div>
<div>
Now let's see what happens when I convert <b><span style="font-family: "courier new" , "courier" , monospace;">x</span></b> into a factor variable using the following syntax:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_YWj_cpM5iunhPKqIycWdXCFTszRSDMfpNskhrX_HNnzOiPv6jt48v4nz3PM4XZHPiRD56fQwWuTDsHGJkH0CNU816Q_KWPcYApCiXcqrqgVF_uEDxjEmcyKhQ-4V1OdNnB0TiIX3Ekg/s1600/fig3.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_YWj_cpM5iunhPKqIycWdXCFTszRSDMfpNskhrX_HNnzOiPv6jt48v4nz3PM4XZHPiRD56fQwWuTDsHGJkH0CNU816Q_KWPcYApCiXcqrqgVF_uEDxjEmcyKhQ-4V1OdNnB0TiIX3Ekg/s1600/fig3.PNG" /></a></div>
<br />
The above code adds a new variable <b><span style="font-family: "courier new" , "courier" , monospace;">male </span></b>to the data set, and assigns labels <b>female </b>and <b>male </b>to the categories 0 and 1 respectively.</div>
<div>
<div>
<br /></div>
<div>
I compute the average age for males and females as follows:</div>
<div>
<br /></div>
<div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhb5CKo2vvnWNaQlmtgxggtf7W4K5FcO6zzDEJonMTB6TWomrUYgfhrYH88vQDtLnlAX68Srl8nwQzv9U2eCIWnHGyihr-9QVMSf4rOYUyD_eqeCdlrSfKdupqiKuEpOPZqhLxI5xWnZ3s/s1600/fig4.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhb5CKo2vvnWNaQlmtgxggtf7W4K5FcO6zzDEJonMTB6TWomrUYgfhrYH88vQDtLnlAX68Srl8nwQzv9U2eCIWnHGyihr-9QVMSf4rOYUyD_eqeCdlrSfKdupqiKuEpOPZqhLxI5xWnZ3s/s1600/fig4.PNG" /></a></div>
<br /></div>
</div>
<div>
See what happens when I try to compute the mean for the variable '<span style="font-family: "courier new" , "courier" , monospace;"><b>male</b></span>'.<br />
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglFi0xQC9ZOVx1Yqzi6GJNAvVS__eETHSec7vgnqjEQwpmah1J_Pn4PFwoMmh-KmXszOBMz9l8O-lRjQvTPL-Kibv3EeR-tTM3SDHALa-_YrwFHLwwotco8Zz8yxXPH3GHFnU0sV1s15k/s1600/fig5.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEglFi0xQC9ZOVx1Yqzi6GJNAvVS__eETHSec7vgnqjEQwpmah1J_Pn4PFwoMmh-KmXszOBMz9l8O-lRjQvTPL-Kibv3EeR-tTM3SDHALa-_YrwFHLwwotco8Zz8yxXPH3GHFnU0sV1s15k/s1600/fig5.PNG" /></a></div>
<br /></div>
<div>
<div>
Once you factor a variable, you can't compute statistics such as <b>mean </b>or <b>standard deviation</b>. To do so, you need to declare the factor variable as numeric. I create a new variable <span style="font-family: "courier new" , "courier" , monospace;"><b>gender</b> </span>that converts the <span style="font-family: "courier new" , "courier" , monospace;"><b>male</b> </span>variable to a numeric one.</div>
</div>
<div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5j68YLVWIQp0PJJ8RJRIsFqLVA4520UuhwuxUfbtYLkScFtESaOYRsRdLP4AjY8Fyi7FvPebJBxTVc4GwfdFzBFuKL5wRBq-KQ4Xvp34Zz0U-rlWL11AH7grCk5oN4kgV-u5PsQ3d1jA/s1600/fig6.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5j68YLVWIQp0PJJ8RJRIsFqLVA4520UuhwuxUfbtYLkScFtESaOYRsRdLP4AjY8Fyi7FvPebJBxTVc4GwfdFzBFuKL5wRBq-KQ4Xvp34Zz0U-rlWL11AH7grCk5oN4kgV-u5PsQ3d1jA/s1600/fig6.PNG" /></a></div>
</div>
<div>
I recompute the means below. </div>
<div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3Q-jtvbGR2uSdhHYSAl80mribnUJ1LSlET0pJRqDLaMJvuZ6HblrK5yykdpabj4nzsMZsq9mTUW0dZ14YbsBb7DYorZhdioBN_kUBIpUEYiw7A_-LQbSredsCg4iVhtN8ThjhqXGGwVE/s1600/fig7.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi3Q-jtvbGR2uSdhHYSAl80mribnUJ1LSlET0pJRqDLaMJvuZ6HblrK5yykdpabj4nzsMZsq9mTUW0dZ14YbsBb7DYorZhdioBN_kUBIpUEYiw7A_-LQbSredsCg4iVhtN8ThjhqXGGwVE/s1600/fig7.PNG" /></a></div>
<br /></div>
<div>
Note that the average for <b>males </b>is 1.45 and not 0.45. Why? When we created the factor variable, it turned zeros into ones and ones into twos. Let's look at the data set below:</div>
<div>
<br /></div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhumTQn1O_RXcTYNiFzJDUM1k6IWYKoA_2q27vx_p11S8PU4k4NTZ2awboLNzGKzbk9jYW8qB4QhbnFMhmFxt8bXrNPmtE_oFFd46-BeRtwGnVkSsRXL2h2XfIbplripJ-CTd7zgmM9eZo/s1600/fig8.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhumTQn1O_RXcTYNiFzJDUM1k6IWYKoA_2q27vx_p11S8PU4k4NTZ2awboLNzGKzbk9jYW8qB4QhbnFMhmFxt8bXrNPmtE_oFFd46-BeRtwGnVkSsRXL2h2XfIbplripJ-CTd7zgmM9eZo/s1600/fig8.PNG" /></a></div>
</div>
</div>
<div>
<br /></div>
<div>
Several algorithms in R expect the factor variable to be of 0/1 form. If this condition is not satisfied, the command returns an error. For instance, when I try to estimate the logit model with <span style="font-family: "courier new" , "courier" , monospace;"><b>gender</b> </span>as the dependent variable and <b><span style="font-family: "courier new" , "courier" , monospace;">y </span></b>as the explanatory variable, R generates the following error:</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><b><br /></b></span></div>
<div>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWQLvzKeYgIacObOK36eqOuW46dedmCPhOe9RaTgE_iOD75wpCAMKo0o0K14cX6ptHsWUqqMRKosmJRJd30M2PnBRD_6Hwwjix_EpCLXA4qoHgUl0V8oGwKxzh6UhITutRywf-2Fej0So/s1600/fig9.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiWQLvzKeYgIacObOK36eqOuW46dedmCPhOe9RaTgE_iOD75wpCAMKo0o0K14cX6ptHsWUqqMRKosmJRJd30M2PnBRD_6Hwwjix_EpCLXA4qoHgUl0V8oGwKxzh6UhITutRywf-2Fej0So/s1600/fig9.PNG" /></a></div>
Factor or no factor, I would prefer my zeros to stay as zeros!</div>
</div>
<div>
<br />
<br />
<br /></div>
</div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com3tag:blogger.com,1999:blog-1816488797213954066.post-9171539010234826502016-08-22T01:57:00.000-04:002016-08-22T01:57:19.545-04:00Five Questions about Data Science<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
From Safari Books Online (<a href="https://www.safaribooksonline.com/blog/2016/02/10/data-science-qa/">https://www.safaribooksonline.com/blog/2016/02/10/data-science-qa/</a>) </div>
<div style="text-align: left;">
---</div>
<div style="text-align: left;">
<br /></div>
Recently, we were able to ask five questions of<b><i> <a href="https://twitter.com/regionomics">Murtaza Haider</a>,</i></b> about the new book from <a href="http://www.redbooks.ibm.com/ibmpress/">IBM Press</a> called <a href="https://www.safaribooksonline.com/library/view/getting-started-with/9780133991246/">“Getting Started with Data Science: Making Sense of Data with Analytics.”</a> Below, the author talks about the benefits of data science in today’s professional world.<div>
<br /><div style="text-align: left;">
<a href="https://www.safaribooksonline.com/library/view/getting-started-with/9780133991246/" style="background: rgb(255, 255, 255); box-sizing: border-box; clear: right; color: #e98300; float: right; font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px; margin-bottom: 1em; margin-left: 1em; text-decoration: none;"><img alt="Getting Started with Data Science" class="alignnone wp-image-22033" height="152" src="https://www.safaribooksonline.com/blog/wp-content/uploads/2016/02/download.jpg" style="border: 0px; box-sizing: border-box; height: auto; margin-bottom: 10px; margin-top: 10px; max-width: 100%; vertical-align: middle;" width="102" /></a><span style="background-color: white; color: #333333; font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; line-height: 20px;"></span></div>
<h3 style="text-align: left;">
1. What are some examples of data science altering or impacting traditional professional roles already?</h3>
<div style="box-sizing: border-box; margin-bottom: 1.5em; text-align: left;">
Only a few years ago there did not exist a job with the title Chief data scientist. But that was then. Small and large corporations, and increasingly government agencies are putting together teams of data scientists and analysts under the leadership of Chief data scientists. Even White House has a Chief data scientist position, currently held by Dr. DJ Patel.<br /><br />The traditional role for those who analyzed data was that of a computer programmer or a statistician. In the past, firms collected large amounts of data to archive rather than to subject it to analytics to assist with smart decision-making. Companies did not see value in turning data into insights and instead relied on the gut feeling of managers and anecdotal evidence to make decisions.<br /><br />Big data and analytics have alerted businesses and governments to the latent potential of turning bits and bytes into profits. To enable this transformation, hundreds of thousands of data scientists and analysts are needed. Recent reports suggest that the shortage of such professionals will be in millions. No wonder we see hundreds of postings for data scientists on LinkedIn.<br /><br />As businesses increasingly depend upon analytics-driven decision making, data scientists and analysts are simultaneously becoming front-office superstars, which is quite a change from them being the back office workers in the past.</div>
<h3 style="text-align: left;">
2. What steps can a professional take today to learn how and why to implement data science into their current role?</h3>
Sooner than later, workers will find their managers asking them to assume additional responsibilities that would involve dealing with data, and either generating or consuming analytics. Smart professionals, who are uninitiated in data science, would therefore proactively address this shortcoming in their portfolio by acquiring skills in data science and analytics. Fortunately, in the world awash with data, the opportunities to acquire analytic skills are also ubiquitous.<br /><br />For starters, professionals should consider enrolling in open online courses offered by the likes of Coursera and BigDataUniversity.com. These platforms offer a wide variety of training opportunities for beginners and advanced users of data and analytics. At the same time, most of these offerings are free.<br /><br />For those professionals who would like to pursue a more structured approach, I suggest that they consider continuing education programs offered by the local universities focusing on data and analytics. While working full-time, the professionals can take part-time courses in data science to fill the gap in their learning and be ready to embrace impending change in their roles.<br />
<div style="box-sizing: border-box; margin-bottom: 1.5em; text-align: left;">
<br /></div>
<h3 style="text-align: left;">
3. Do you need programming experience to get started in data? What kind of methods and techniques can you utilize in a program more commonly used, such as Excel?</h3>
Computer programming skills are a definite plus for data scientists, but they are certainly not a limiting factor that would prevent those trained in other disciplines from joining the world of data scientists. In my book, Getting Started with Data Science, I mentioned examples of individuals who took short courses in data science and programming after graduating from non-empirical disciplines, and subsequently were hired in data scientist roles that paid lucrative salaries.<br /><br />The choice of analytics tools depends largely on the discipline and the type of organization you are currently working for or intend to work for in the future. If you intend to work for corporations that generate real big data, such as telecom and Internet-based establishments, you need to be proficient in big data tools, such as Spark and Hadoop. If you would like to be employed in the industry that tracks social media, you would require skills in natural language programming and proficiency in languages such as Python. If you happen to be interested in a traditional market research firm, you need proficiency in analytics software, such as SPSS and R.<br /><br />If your focus is on small and medium size enterprises, proficiency in Excel could be a great asset, which would allow you to deploy its analytics capabilities, such as Pivot Tables, to work with small sized data.<br /><br />A successful data scientist is one who knows some programming, basic understanding of statistical principles, possesses a curious mind, and is capable of telling great stories. I argue that without the storytelling capabilities, a data scientist will be limited in his or her abilities to become a leader in the field.<br />
<div style="box-sizing: border-box; margin-bottom: 1.5em; text-align: left;">
<br /></div>
<h3 style="text-align: left;">
4. How do you see data science affecting education and training moving forward? What benefits will it bring to learning at all levels?</h3>
Schools, colleges, universities and others involved in education and learning are putting big data and analytics to good use. Universities are crunching large amounts of data to determine what gaps in learning at the high school level act as impediments to success in the future. Schools are improving not just curriculum, but also other strategies to improve learning outcomes. For instance, research in India using large amounts of data showed that when children in low-income communities were offered free meals at school, their dropout rates declined and their academic achievements improved.<br /><br />Big data and analytics provide instructors and administrators the opportunity to test their hypothesis about what works and what doesn’t in learning, and replace anecdotes with hard evidence to improve pedagogy and learning. Learning has taken a new shape and form with open online courses in all disciplines. These transformative changes in learning have been enabled by advances in information and communication technologies, and the ability to store massive amounts of data.<br />
<div style="box-sizing: border-box; margin-bottom: 1.5em; text-align: left;">
</div>
<h3 style="text-align: left;">
5. Do you think that modern governments and societies are prepared for what changes that big data and data science might bring to the world?</h3>
Change is inevitable. Despite what modern governments and societies like, they would have to embrace change. Fortunately, smart governments and societies have already embraced data-driven decision-making and evidence-based planning. Governments in developing countries are already using data and analytics to devise effective poverty-reducing strategies. Municipal governments in developed economies are using data and advanced analytics to find solutions to traffic congestion. Research in health and well-being is leveraging big data to discover new medicines and cures for illnesses that challenge us all.<br /><br />As societies embrace data and analytics as tools to engineer prosperity and well-being, our collective abilities to achieve a better tomorrow will be further enhanced.</div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-90263035384710586232016-08-10T17:40:00.002-04:002016-08-10T17:40:53.061-04:00So you want to be a data scientist<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
From <a href="http://www.huffingtonpost.ca/murtaza-haider/data-scientist-career_b_11426570.html" target="_blank">HuffingtonPost</a></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
The <i>New York Times</i> made it look so easy. Take a few courses in data science and a web-based startup will readily pay top dollars for your newly acquired skills.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">Since the <a href="http://www.mckinsey.com/business-functions/business-technology/our-insights/big-data-the-next-frontier-for-innovation" target="_hplink"><span lang="EN-US">McKinsey Global Institute</span></a> reported on the impending shortage of data crunchers, the wanna be data scientists are searching for learning opportunities in big data analytics. Newspaper coverage suggests that even with limited previous exposure to empirics, one may enroll in <a href="https://en.wikipedia.org/wiki/Massive_open_online_course" target="_hplink"><span lang="EN-US">MOOCs </span></a>or join programming boot camps to establish one's bonafides.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">In a recent blog on <a href="http://www.forbes.com/sites/metabrown/2016/07/29/4-reasons-not-to-get-that-masters-in-data-science/#5a484391694f" target="_hplink"><span lang="EN-US">Forbes.com</span></a>, Meta S. Brown, the author of <a href="https://www.amazon.ca/Data-Mining-Dummies-Meta-Brown/dp/1118893174" target="_hplink"><i><span lang="EN-US">Data Mining for Dummies</span></i></a>, gave four reasons not to get an advanced degree in data science. I, on the other hand, believe that a structured learning environment is exactly what many need to enable the career change they have contemplated for years but have not moved on it.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">It all depends on upon what kind of a learner you are. If you are a disciplined, self-motivated, self-actuated individual, you can pick up the skills by attending MOOCs or participating in coding boot camps.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">But if you are like the rest of us, who once enrolled in a free online course, but didn't complete it, you need some structure. A degree or a certificate in data science or business analytics is exactly what you need to upgrade your skills and be part of the network that will help you reorient your career.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">In my book, <a href="http://www.ibmpressbooks.com/store/getting-started-with-data-science-making-sense-of-data-9780133991024" target="_hplink"><i><span lang="EN-US">Getting Started with Data Science</span></i></a>, I mentioned <a href="http://www.nytimes.com/2015/07/29/technology/code-academy-as-career-game-changer.html?_r=0" target="_hplink"><span lang="EN-US">Paul Minton</span></a>, who was making $20,000 serving tables in New York. However, a three-moth programming course at the <a href="http://www.zipfianacademy.com/" target="_hplink"><span lang="EN-US">Zipfian Academy</span></a> turned his life around. He earned over $100,000 in 2014 as a data scientist for a web startup in San Francisco. "Six figures, right off the bat ... To me, it was astonishing," he told <i>The New York Times</i>.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOT72-f8qG1zc6B-qy4XtPuJzKlzCLPHj2cVspy04BNkFBoQNeNwYYPDmbPoaMj8SnTaMuKTwervM8kXYBlSjuDbmBYr4I5XZBfWhJIqQwUY7rhGcMc3C89jgEiGF8wDFhIXfA3sNXQts/s1600/Murtaza+Haider+Getting+started+cover.jpg" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjOT72-f8qG1zc6B-qy4XtPuJzKlzCLPHj2cVspy04BNkFBoQNeNwYYPDmbPoaMj8SnTaMuKTwervM8kXYBlSjuDbmBYr4I5XZBfWhJIqQwUY7rhGcMc3C89jgEiGF8wDFhIXfA3sNXQts/s320/Murtaza+Haider+Getting+started+cover.jpg" width="213" /></a></div>
<div class="MsoNormal">
When the inspiring data scientists think of a career in the 'glamorous' world of big data and analytics, they think of Mr. Minton. His story, though a bit Cinderella-ish, is true, but rare. He works for <a href="http://www.change.org/" target="_hplink"><span lang="EN-US">Change.org</span></a>! However, not everyone should expect a similar outcome. In addition to good fortune, Mr. Minton had majored in math in his undergraduate training, and we all know that math helps. It will be unwise, however, to assume that with almost no empirical background, one can master the </div>
<div class="MsoNormal">
<span lang="EN-CA">complex world of data and algorithms in a matter of a few weeks and be gainfully employed.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">While speaking at meet-ups organized by IBM's <a href="https://bigdatauniversity.com/" target="_hplink"><span lang="EN-US">BigDataUniversity</span></a>, I encounter dozens of enthusiasts who are keen to start training in data science but do not know where to begin. I advise them to build on their core competencies and domain knowledge. For instance, if you studied journalism or creative writing as an undergraduate, you might want to learn how to analyze socioeconomic data instead of trying to set up <a href="http://www.ibm.com/analytics/ca/en/technology/hadoop/?S_TACT=M1610LNW&S_PKG=-&campaign=Unbranded|Search|Hadoop%20HEAD%20TERM|Canada|N/A|&group=Hadoop_EXACT&mkwid=c6c2fcb6-3098-452f-9d1a-b1ce0f3044ff|509|&ct=M1610LNW&iio=BANAL&cmp=M1610&ck=hadoop&cs=e&ccy=US&cr=bing&cm=k&cn=Hadoop_EXACT" target="_hplink"><span lang="EN-US">Hadoop clusters</span></a>, a big data task best left to computer scientists and engineers.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">If you are a disciplined learner, you can explore data science training offered as MOOCs. <a href="https://www.coursera.org/" target="_hplink"><span lang="EN-US">Coursera</span></a>, one of the largest MOOCs platform, listed several data science courses among the top <a href="http://www.businessinsider.com/most-popular-coursera-courses-of-2015-2015-12" target="_hplink"><span lang="EN-US">10 most popular courses</span></a> in 2015. IBM's Big Data University (BDU) is another platform dedicated to promoting training in data science and analytics. Not only BDU offers similar resources for online learning as other platforms, it also offers cloud-based resources for hands-on training through the <a href="https://datascientistworkbench.com/" target="_hplink"><span lang="EN-US">Data Scientist Workbench</span></a>.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">The Workbench provides the state-of-the-art computing solutions for regular-sized data. These include <a href="https://www.r-project.org/" target="_hplink"><span lang="EN-US">R</span></a>, <a href="https://www.python.org/" target="_hplink"><span lang="EN-US">Python</span></a>, and <a href="http://openrefine.org/" target="_hplink"><span lang="EN-US">OpenRefine</span></a>. To wrangle big data, the Workbench offers Hadoop and <a href="http://www.ibm.com/analytics/us/en/technology/cloud-data-services/data-scientist/?S_PKG=&S_TACT=M1610NHW&campaign=Branded|Search|Cloud%20Data%20Services%20For%20Data%20Scientists|NA|N/A|&group=Data_Science-COG%20(Broad)&mkwid=c6c2fcb6-3098-452f-9d1a-b1ce0f3044ff|509|109165367546&ct=M1610NHW&iio=BANAL&cmp=M1610&ck=ibm%20data%20science&cs=b&ccy=US&cr=google&cm=k&cn=Data_Science-COG%20(Broad)" target="_hplink"><span lang="EN-US">Spark</span></a>-based solutions. Such coupling of computing infrastructure with online learning resources frees the new learners from the concerns about installing and maintaining software and clustering hardware.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">For learners who would prefer a structured learning environment, they also have several options. They can register for courses or certificates offered by universities' <a href="http://learn.utoronto.ca/courses-programs/business-professionals/certificates/management-of-enterprise-data-analytics" target="_hplink"><span lang="EN-US">continuing education faculties</span></a>, enroll in an online graduate degree in data science, or take a more traditional approach of enrolling in a full- or part-time <a href="http://www.ryerson.ca/graduate/datascience/index.html" target="_hplink"><span lang="EN-US">Master's program</span></a>.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">A good place to search for learning opportunities is the <a href="http://www.kdnuggets.com/education/usa-canada.html" target="_hplink"><span lang="EN-US">KDNuggets </span></a>website that maintains detailed lists of post-graduate programs in data science including full-time, part-time, and online masters and other certifications.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">Once you have earned some credentials, you still have to prove your worth to future employers. If you are making a switch from another career, your experience may not be of much use in your pursuits in the data-centric world. My advice to the novice data scientists lacking experience is to ask the potential employer not necessarily for a job, but instead for a data set and a puzzle. If you can solve a data-oriented problem for a firm as part of the vetting process, you can overcome the shortcomings in your résumé.<o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><br /></span></div>
<div class="MsoNormal">
<span lang="EN-CA">For those who are still on the fence thinking whether to take the plunge into the world of big data and analytics, they should know that the demand for data scientists far exceeds the capacity of the universities and colleges to produce them. This is unlikely to change shortly. Act now and embrace data.</span> </div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-26096473478806785602016-07-27T14:56:00.000-04:002016-07-27T14:57:57.739-04:00Book Review: Getting Started With Data Science<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="http://www.i-programmer.info/bookreviews/218-data-science/9938-getting-started-with-data-science.html" target="_blank">I PROGRAMMER's</a> Kay Ewbank's reviews <i><a href="http://www.ibmpressbooks.com/store/getting-started-with-data-science-making-sense-of-data-9780133991024" target="_blank">Getting Started with Data Science: Making Sense of Data with Analytics</a></i>.<br />
<br />
By <a href="http://www.i-programmer.info/bookreviews/218-data-science/9938-getting-started-with-data-science.html" target="_blank">Kay Ewbank</a><br />
<br />
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
If you've enjoyed books such as <i>Freakonomics </i>or <i>Outliers</i>, you'll feel at home reading this book as it uses a similar approach; take an interesting question such as <em>'Does the higher price of cigarettes deter smoking?'</em>, and use that as the basis for some data analysis.</div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br /></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
The aim is to teach you how to do your own analyses. <a href="https://sites.google.com/site/statsr4us/intro/4-instructor" target="_blank">Haider </a>works through the examples in R, Stata, SPSS and SAS. Within the book the examples are worked mainly in R, and one of the other languages. The code for the other languages is available for download from the IBM Press website, along with details of how to use it. </div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br /></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
</div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
The book opens with a chapter called 'the bazaar of storytellers' that discusses what data science is and gives the author's definition of a data scientist. The next chapter, data in the 24/7 connected world, identifies sources of data that you can analyse, and also introduces the concept of big data. Chapter three looks at how data becomes meaningful when it is used as the basis for 'stories'. Haider's view is that the strength of data science lies in the power of the narrative, and that is what underpins most of the book.</div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br /></div>
<h2 style="color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px; text-align: left;">
<span style="background-color: white;">"</span><span style="line-height: 18.2px;"><i style="background-color: #fce5cd;">Overall, this is a book that is accessible, interesting and still manages to introduce the statistical techniques you need to use for real data analytical work. A good way to get into data analysis</i><span style="background-color: white;">."</span></span></h2>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br /></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
From a practical perspective, the book begins to get useful in chapter four, which looks at how you can generate summary tables, including multi-dimensional tables. Next is a chapter on graphics and how to generate them. If you're thinking that it seems a bit odd to concentrate on the 'end result' first, you have to remember that the author's view is that data analysis is only useful if your audience actually looks at the results and understands them.</div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br /></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<iframe allowfullscreen="true" frameborder="0" height="550" src="https://read.amazon.co.uk/kp/card?asin=B019D322UU&asin=B019D322UU&preview=inline&linkCode=kpe&ref_=cm_sw_r_kb_dp_2f6LxbX70C6DK&tag=iprog-20&hideShare=true" style="display: block; margin-left: auto; margin-right: auto; max-width: 100%;" type="text/html" width="336"></iframe></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br /></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
The next chapter gets more into the workings of data analysis with an examination of hypothesis testing using techniques such as t-tests and correlation analysis. Regression analysis is looked at next, based on the notions "why tall parents don't have even taller children". This is a fun chapter, with examples including consumer spending on food and alcohol, housing markets, and whether the appearance of teachers affects their evaluations by students.<br />
<br /></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
A chapter on analysis of binary variables considers logit and probit models using data from New York transit use. Categorical data and multinomial variables are the topic of the next chapter, which expands on the ideas of logit models.</div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br />
Spatial data analysis is covered next, taking us into the use of GIS systems and how these have expanded the options for data analysis. There's a good chapter on time series analysis looking at how regression models can be used with time series data, using the examples of forecasting housing markets.</div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br />
The final chapter introduces the field of data mining. It's more of a taster discussing some of the techniques that can be used, but fun anyway.</div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br />
Overall, this is a book that is accessible, interesting and still manages to introduce the statistical techniques you need to use for real data analytical work. A good way to get into data analysis. </div>
<h4 style="background-color: white; color: #333333; font-family: Arial, Helvetica, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 1px;">
Related Reviews</h4>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<a href="http://www.i-programmer.info/bookreviews/218-data-science/8696-data-science-and-big-data-analytics.html" style="color: #135cae; text-decoration: none;">Data Science and Big Data Analytics</a></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<a href="http://www.i-programmer.info/bookreviews/218-data-science/7865-doing-data-science.html" style="color: #135cae; text-decoration: none;">Doing Data Science</a></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<a href="http://www.i-programmer.info/bookreviews/218-data-science/8952-r-in-action-data-analysis-and-graphics-with-r-2e.html" style="color: #135cae; text-decoration: none;">R in Action: Data Analysis and Graphics with R (2e)</a></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<a href="http://www.i-programmer.info/bookreviews/218-data-science/9380-learning-to-love-data-science.html" style="color: #135cae; text-decoration: none;">Learning To Love Data Science</a> </div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
<br /></div>
<div style="background-color: white; color: #333333; font-family: Helvetica, Arial, sans-serif; font-size: 14px; line-height: 18.2px; margin-bottom: 5px;">
To keep up with our coverage of books for programmers, follow <a href="http://twitter.com/bookwatchiprog" style="color: #135cae; text-decoration: none;">@bookwatchiprog on Twitter</a> or subscribe to I Programmer's <a href="http://www.i-programmer.info/component/ninjarsssyndicator/?feed_id=2&format=raw" style="color: #135cae; text-decoration: none;" target="_blank">Books RSS feed </a>for each day's new addition to Book Watch and for new reviews.</div>
<div>
<br /></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-8047123382293196242016-07-24T19:41:00.000-04:002016-07-27T14:58:19.127-04:00The collaborative innovation landscape in data science<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
Computing platforms should be like Lego. That is, they should provide the fundamental building blocks and enable the users' imagination to innovate. The <a href="http://www.stata-journal.com/current-issue/">latest issue of Stata Journal</a> exemplifies how <a href="http://www.stata.com/">Stata</a> and, by the same account, <a href="https://www.r-project.org/">R</a> provide the platform for the users to innovate beyond the innate capacity of the core group responsible for software development.</div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-CA">Earlier in July, I received in an email the table of contents for the <i>Stata Journal</i>’s latest issue. I was expecting to see one or maybe two articles of interest. What I found was quite surprising. I was intrigued by almost every article, which made me wonder if I had lost my academic focus so that almost anything is now of interest? <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-CA">As I browsed through the journal, I noticed that the authors contributing to the journal were truly international. From academic colleagues in Germany and the United States to colleagues working for central banks in Europe, the diversity was hard to ignore. And that’s where I spotted the apparent similarity between R and Stata. Even though Stata is a proprietary computing platform, the innovation landscape is not restricted to the core team at Stata. This is similar to the R environment where literally thousands of packages (algorithms) for R are contributed by independent researchers.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-CA">For R, such collaborative ecosystem comes naturally for R being free software. Stata, on the other hand, follows a more traditional market approach of charging for the use of the software. Yet, Stata and R are able to attract leading data scientists (my preferred term for statisticians, econometricians, and others) to volunteer their innovation expertise that they readily share with the larger community. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-CA">Returning to the latest issue, I was first attracted to the article on assessing inequality using percentile shares. As the author, <a href="http://www.soz.unibe.ch/about_us/personen/prof_dr_jann_ben/index_eng.html">Ben Jann from the University of Bern</a>, noted income inequality has come to the forefront of academic and social discourse since the publication of Thomas Piketty’s <i>Capital in the Twenty-First Century</i>. I have been intrigued by the topic for years, primarily influenced by the incredible works of <a href="http://www8.gsb.columbia.edu/faculty/jstiglitz/sites/jstiglitz/files/Inequality%20and%20Economic%20Growth.pdf">Joseph Stiglitz</a>, <a href="http://www.wsj.com/articles/angus-deaton-discusses-income-inequality-1448302780">Angus Deaton</a>, and others. Piketty’s <i>Capital</i>, despite the criticism (watch <a href="https://www.youtube.com/watch?v=D4mE-X140aA">Deidre Mccloskey’s</a> careful, yet blunt, review of the <i>Capital</i>), has made percentile shares familiar to analyze distributional inequalities. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-CA">Ben Jann has contributed <a href="http://www.stata.com/meeting/germany15/abstracts/materials/de15_jann.pdf">pshare</a> to Stata that readily estimates inequalities with the convenience of a single-line syntax. Using the data from the 1988 US National Study of Young Women, the command easily computes the income distribution showing that the top 10-percent women received 27% of the wages.<o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-CA">For R users, I would recommend the <a href="https://cran.r-project.org/web/packages/ineq/ineq.pdf">ineq package by Achim Zeileis and Christian Kleiber </a>for generating inequality and poverty indices. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-CA">My primary area of interest lies at the intersection of real estate and transportation in urban settings. I am always struggling with how location impacts rents, growth, and other socio-economic outcomes. Determining the location or, for that matter, distances between entities is usually a struggle. Thanks to the GIS software, such as <a href="http://www.qgis.org/en/site/">QGIS</a>, <a href="http://www.pitneybowes.com/us/location-intelligence.html?products-tab">MapInfo</a>, and <a href="http://www.caliper.com/maptovu.htm">Maptitude</a>, the task of spatial computation has become a lot easier. Still, one has to get proficient on several computing platforms to achieve the necessary tasks of getting distances or travel times to and from locations. Stata offers two interesting solutions for these tasks. The latest one is reported in the latest issue. <a href="http://www.uni-regensburg.de/wirtschaftswissenschaften/vwl-moeller/team/index.html">Stephan Huber</a> and Christoph Rust from the University of Regensburg have a contributed a new command that <a href="http://www.stata-journal.com/article.html?article=dm0088">computes network distances</a> (not just the straight-line Euclidean distances) and network travel times for the shortest paths that rely on Open Source Routing Machine and OpenStreetMap. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-CA">Earlier in 2011, <a href="http://www.stata-journal.com/sjpdf.html?articlenum=dm0053">Adam Ozimek and Daniel Miles</a> contributed commands to geocode and compute travel times between origins and destinations for different modes of travel, including public transit. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<div class="MsoNoSpacing">
<span lang="EN-CA">R is equally equipped for similar tasks. Timothée Giraud, Robin Cura, and Matthieu Viry programmed an R package <a href="https://cran.r-project.org/web/packages/osrm/osrm.pdf">osrm</a> to determine travel time and distances. Other R packages include <a href="https://cran.r-project.org/web/packages/gdistance/vignettes/gdistance1.pdf">gdistance</a> and <a href="https://cran.r-project.org/web/packages/gmapsdistance/gmapsdistance.pdf">gmapdistance</a>, to name a few. <o:p></o:p></span></div>
<div class="MsoNoSpacing">
<br /></div>
<br />
<div class="MsoNoSpacing">
<span lang="EN-CA">In summary, I remain delightedly optimistic about the future of both open source and proprietary computing platforms. Altruism is the name of the game where thousands of innovators are making their generous contributions available for the larger benefit of the society making it easier for applied data scientists to satisfy their curiosities by applying readily available algorithms to solve riddles. <sub><o:p></o:p></sub></span></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-71544425628597243662016-07-18T04:10:00.001-04:002016-07-18T04:10:41.608-04:00Data Science Boot Camp completed at Ryerson University<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://sites.google.com/site/statsr4us/workshops/datascience/week-5/Picture2.png?attredirects=0" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="245" src="https://sites.google.com/site/statsr4us/workshops/datascience/week-5/Picture2.png?attredirects=0" width="400" /></a></div>
<div class="MsoNormal">
<span lang="EN-CA"><span style="font-family: inherit;">I am pleased to update you on the <a href="https://sites.google.com/site/statsr4us/workshops/datascience">Data Science Boot Camp</a> we ran at the <a href="http://www.ryerson.ca/tedrogersschool/" target="_blank">Ted Rogers School of Management</a> at Ryerson University in Toronto in collaboration with IBM’s <a href="http://www.bigdatauniversity.com/">www.BigDataUniversity.com</a>. The 9-week long Boot Camp concluded on July 15. </span></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><span style="font-family: inherit;"><br /></span></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><span style="font-family: inherit;">We received a total of 1,137 registrations and the attendance ranged between 100 to 150 participants each week. <o:p></o:p></span></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA"><span style="font-family: inherit;">I have made the resources (software codes, PowerPoints, etc.) available online at <a href="https://sites.google.com/site/statsr4us/workshops/datascience">https://sites.google.com/site/statsr4us/workshops/datascience</a>. We recorded 24 hours of video, which we will be online soon.<o:p></o:p></span></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA"><span style="font-family: inherit;">I restricted the hands-on training to R, hence the Boot Camp serves as an introduction to analytics with R. You are welcome to share these resources.<o:p></o:p></span></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><span style="font-family: inherit;"><br /></span></span></div>
<div class="MsoNormal">
<span lang="EN-CA"><span style="font-family: inherit;">A breakdown of weekly schedule is provided in the following hyperlinked list:</span></span></div>
<div class="MsoNormal">
<br /></div>
<br />
<div class="MsoNormal">
<span lang="EN-CA" style="background: white; font-family: "Lucida Sans Unicode", sans-serif; font-size: 10pt;"><a href="https://sites.google.com/site/statsr4us/workshops/datascience/week-0"><span style="color: #5e7ab2;">Week 0: Introduction & logistics</span></a><span class="apple-converted-space"> </span></span></div>
<div class="MsoNormal">
<span lang="EN-CA" style="background: white; font-family: "Lucida Sans Unicode", sans-serif; font-size: 10pt;"><a href="https://sites.google.com/site/statsr4us/workshops/datascience/week-1"><span style="color: #5e7ab2;">Week-1: Head first into data</span></a><span class="apple-converted-space"> </span></span></div>
<div class="MsoNormal">
<span lang="EN-CA" style="background: white; font-family: "Lucida Sans Unicode", sans-serif; font-size: 10pt;"><a href="https://sites.google.com/site/statsr4us/workshops/datascience/week2"><span style="color: #5e7ab2;">Week-2: Data, data, & data</span></a><span class="apple-converted-space"> </span></span></div>
<div style="text-align: left;">
<a href="https://sites.google.com/site/statsr4us/workshops/datascience/week-3"><span style="font-family: Arial, Helvetica, sans-serif; font-size: x-small;">Week-3: The Grandfather of All Analytics</span></a></div>
<div class="MsoNormal">
<span lang="EN-CA" style="background: white; font-family: "Lucida Sans Unicode", sans-serif; font-size: 10pt;"><a href="https://sites.google.com/site/statsr4us/workshops/datascience/week-4"><span style="color: #5e7ab2;">Week-4: Rinse & Repeat</span></a><span class="apple-converted-space"> </span></span></div>
<div class="MsoNormal">
<span lang="EN-CA" style="background: white; font-family: "Lucida Sans Unicode", sans-serif; font-size: 10pt;"><a href="https://sites.google.com/site/statsr4us/workshops/datascience/week-5"><span style="color: #5e7ab2;">Week-5: The Economics of Infidelity</span></a><span class="apple-converted-space"> </span></span></div>
<div class="MsoNormal">
<span lang="EN-CA" style="background: white; font-family: "Lucida Sans Unicode", sans-serif; font-size: 10pt;"><a href="https://sites.google.com/site/statsr4us/workshops/datascience/week-6"><span style="color: #5e7ab2;">Week-6: Time Series is Money</span></a><span class="apple-converted-space"> </span></span></div>
<div class="MsoNormal">
<span lang="EN-CA" style="background: white; font-family: "Lucida Sans Unicode", sans-serif; font-size: 10pt;"><a href="https://sites.google.com/site/statsr4us/workshops/datascience/bubble"><span style="color: #5e7ab2;">Week-7: Housing Bubbles</span></a><span class="apple-converted-space"> </span></span></div>
<div class="MsoNormal">
<span lang="EN-CA" style="background: white; font-family: "Lucida Sans Unicode", sans-serif; font-size: 10pt;"><a href="https://sites.google.com/site/statsr4us/workshops/datascience/week-8"><span style="color: #5e7ab2;">Week-8: Choice Metrics</span></a></span><span lang="EN-CA"><o:p></o:p></span></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-91526634783890193412016-05-12T13:03:00.000-04:002016-05-12T13:03:06.842-04:00Data Science Boot Camp<div dir="ltr" style="text-align: left;" trbidi="on">
If you live in or near Toronto, are interested in learning about data science, and can spare Friday afternoons, then you are in luck. I am offering a <a href="https://sites.google.com/site/statsr4us/workshops/datascience" target="_blank">Data Science Boot Camp</a> at <a href="http://www.ryerson.ca/tedrogersschool/bm/programs/real-estate-management.html" target="_blank">Ryerson University</a> in collaboration with IBM's <a href="http://bigdatauniversity.com/">BigDataUniversity.com</a>. <br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://bigdatauniversity.com/bdu-wp/wp-content/uploads/2016/05/Screen-Shot-2016-05-12-at-11.13.47-AM-1024x626.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="http://bigdatauniversity.com/bdu-wp/wp-content/uploads/2016/05/Screen-Shot-2016-05-12-at-11.13.47-AM-1024x626.png" height="244" width="400" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.ibmpressbooks.com/ShowCover.asp?isbn=9780133991024&type=f" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://www.ibmpressbooks.com/ShowCover.asp?isbn=9780133991024&type=f" /></a></div>
The Boot Camp is largely based on the contents of my recently published book, <i><a href="http://www.ibmpressbooks.com/store/getting-started-with-data-science-making-sense-of-data-9780133991024" target="_blank">Getting Started with Data Science</a>: Making Sense of Data with Analytics</i>. You can read more about the book by <a href="http://ekonometrics.blogspot.ca/2016/01/getting-started-with-data-science.html" target="_blank">Clicking HERE</a>.<br />
<h3 style="text-align: left;">
Logistical details:</h3>
<b>When</b>: Fridays (2:00 - 5:00 pm)<br />
<b>Where</b>: 55 Dundas Street West, Toronto, 9th floor, Room 3-109<br />
Ted Rogers School of Management, Ryerson University<br />
<b>Cost: </b>Free (Courtesy Ryerson University & BigDataUniversity)<br />
<b>Starting on</b>: May 13 for introductions. Actual launch is on May 20.<br />
<b>Spaces</b>: I'd like to cap enrollment at 15.<br />
<b>Registration: </b><a href="mailto:murtaza.haider@ryerson.ca" target="_blank">Email us</a> or use <a href="http://bigdatauniversity.com/events/data-science-bootcamp-summer-data-analytics-insight/" target="_blank">Registration Form</a> at BigDataUniversity.<br />
<b>Prerequisites</b>: Curiosity, high-school math, prescribed book, a laptop computer, and willingness to learn R.<br />
<br />
BigDataUniversity will live stream the sessions for those who are unable to attend, but interested in the topic.<br />
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
</div>
<h3 style="text-align: left;">
Tentative Schedule</h3>
<div>
<b>May 13, 2016</b>- Introductions, software details, and logistical details.</div>
<div>
<b>Week 1</b> - Taking the first step</div>
<div>
<div style="text-align: left;">
<ul style="text-align: left;">
<li>Detailed hands-on examples of analytics to understand what you will be able to accomplish by the end of the boot camp.</li>
</ul>
</div>
<b>Week 2</b> - Data: It’s shapes, sizes, and formats<br /><b>Week 3</b> - Regression: The tool that fixes everything, or almost everything.<br /><ul style="text-align: left;">
<li>Applied analytics with teaching evaluations. </li>
<li>Do good-looking instructors get higher teaching evaluations?</li>
</ul>
<b>Week 4 </b>- Correlations, causations, and manufactured facts<br /><b>Week 5 </b>- Aerobics with data: Taming your data to meet your needs.<br /><b>Week 6 </b>- Time is money: Analytics with time series data.<br /><b>Week 7</b> - Case study 1: </div>
<div>
<ul style="text-align: left;">
<li>Do women who lack health insurance from their spouse’s employer more likely to work full-time?</li>
</ul>
<b>Week 8</b> - Case Study 2: </div>
<div>
<ul style="text-align: left;">
<li>Do higher taxes result in lower cigarette sales? Did Land Transfer Tax impact housing sales in Toronto?</li>
</ul>
<b>Week 9</b> - Case Study 3: </div>
<div>
<ul style="text-align: left;">
<li>To smoke or not to smoke: that is the question.</li>
</ul>
<b>Week 10</b> - Case study 4: </div>
<div>
<ul style="text-align: left;">
<li>Is space the new frontier? Map it to know it.</li>
</ul>
</div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-47891997267216030342016-01-13T14:01:00.001-05:002016-01-13T14:01:22.060-05:00Getting Started with Data Science: Storytelling with Data<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<a href="http://www.ibmpressbooks.com/ShowCover.asp?isbn=0133991024" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" src="http://www.ibmpressbooks.com/ShowCover.asp?isbn=0133991024" height="320" width="213" /></a></div>
Earlier this month, IBM Press and Pearson have published my book titled: <a href="http://www.ibmpressbooks.com/store/getting-started-with-data-science-making-sense-of-data-9780133991024"><b><i>Getting Started with Data Science: Making Sense of Data with Analytics</i></b></a>. You can download <a href="http://ptgmedia.pearsoncmg.com/images/9780133991024/samplepages/9780133991024.pdf">sample pages, including a complete chapter</a>. There are 104 pages in the sample. You can also <a href="https://www.youtube.com/watch?v=V-8rvFFvR1s&feature=youtu.be&cm_mc_uid=59325714899914405365497&cm_mc_sid_50200000=1451404530">watch a brief interview about</a> the book recorded earlier at the IBM Insight2015 Conference.<br /><br />The very purpose of authoring this book was to rethink the way we have been teaching statistics and analytics to students and practitioners. It is no secret that most students required to take the mandatory stats course dislike it. I believe it has something to do with the way we have been teaching the subject than to do with the aptitude of our students. Furthermore, I believe there is a greater opportunity to equip the students with the skills needed in a world awash with data where competing on analytics defines the real competitive advantage.<br /><br />No wonder, the latest issue of the leading publication on the subject, <a href="http://www.tandfonline.com/doi/full/10.1080/00031305.2015.1094283"><i>The American Statistician</i></a>, is dedicated to reimagining how statistics should be taught in the undergraduate curriculum. The editors noted:<br />
<blockquote class="tr_bq">
“We hope that this collection of articles as well as the online discussion provide useful fodder for further review, assessment, and continuous improvement of the undergraduate statistics curriculum that will allow the next generation to take a leadership role by making decisions using data in the increasingly complex world that they will inhabit.”</blockquote>
I am confident that my book will do its small part in equipping the next generation of students with the kind of skills needed to succeed in a data-centric world. For one, I have taken a storytelling approach to statistics. This book reinforces the point that data science and analytics training should be applied rather than theoretical, and the ultimate purpose of producing or consuming statistical analysis is to tell fascinating stories from it. Therefore, the book opens with the chapter titled, <b><i>The Bazaar of Storytellers</i></b>.<br /><h3 style="text-align: left;">
Who is this book for?</h3>
While the world is awash with large volumes of data, inexpensive computing power, and vast amounts of digital storage, the skilled workforce capable of analyzing data and interpreting it is in short supply. A 2011 <a href="http://www.mckinsey.com/features/big_data">McKinsey Global Institute</a> report suggests that “the United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.”<br /><br /><br />
<i><b>Getting Started with Data Science</b></i><b><i> (GSDS)</i></b> is a purpose-written book targeted at those professionals who are tasked with analytics, but they do not have the comfort level needed to be proficient in data-driven analytics. GSDS appeals to those students who are frustrated with the impractical nature of the prescribed textbooks and are looking for an affordable text to serve as a long-term reference. GSDS embraces the 24-7 streaming of data and is structured for those users who have access to data and software of their choice, but do not know what methods to use, how to interpret the results, and most importantly how to communicate findings as reports and presentations in print or on-line.<br /><br />GSDS is a resource for millions employed in knowledge-driven industries where workers are increasingly expected to facilitate smart decision-making using up-to-date information that sometimes takes the form of continuously updating data.<br /><br />At the same time, the learning-by-doing approach in the book is equally suited for independent study by senior undergraduate and graduate students who are expected to conduct independent research for their coursework or dissertations. <br /><h3 style="text-align: left;">
Praise for the book</h3>
I am also pleased to share with you the praise for my book by <b><i>Dr. Munir Sheikh, Canada’s former chief statistician</i></b>:<br /><blockquote class="tr_bq">
“The power of data, evidence, and analytics in improving decision-making for individuals, businesses, and governments is well known and well documented. However, there is a huge gap in the availability of material for those who should use data, evidence, and analytics but do not know how. This fascinating book plugs this gap, and I highly recommend it to those who know this field and those who want to learn.”</blockquote>
— Munir A. Sheikh, Ph.D., Distinguished Fellow and Adjunct Professor at Queen’s University<br /><br /><b><a href="http://www.tomdavenport.com/">Tom Davenport</a>,</b> author of the bestselling books <i><a href="http://www.amazon.ca/Competing-Analytics-The-Science-Winning/dp/1422103323">Competing on Analytics</a> </i>and <i><a href="http://www.amazon.ca/Big-Data-Work-Dispelling-Opportunities/dp/1422168166" target="_blank">Big Data @ Work</a>.</i>has the following to say about my book:<br /><blockquote class="tr_bq">
“A coauthor and I once wrote that data scientists held ‘the sexiest job of the 21st century.’ This was not because of their inherent sex appeal, but because of their scarcity and value to organizations. This book may reduce the scarcity of data scientists, but it will certainly increase their value. It teaches many things, but most importantly it teaches how to tell a story with data.”</blockquote>
—Thomas H. Davenport, Distinguished Professor, <b>Babson College</b>; Research Fellow, <b>MIT.</b><div>
<b><br />Dr. Patrick Surry</b>, <b><i>Chief Data Scientist</i></b> at <a href="http://www.hopper.com/">www.Hopper.com</a> had the following to say:<br /><blockquote class="tr_bq">
“This book addresses the key challenge facing data science today, that of bridging the gap between analytics and business value. Too many writers dive immediately into the details of specific statistical methods or technologies, without focusing on this bigger picture. In contrast, Haider identifies the central role of narrative in delivering real value from big data.<br /> <br />“The successful data scientist has the ability to translate between business goals and statistical approaches, identify appropriate deliverables, and communicate them in a compelling and comprehensible way that drives meaningful action. To paraphrase Tukey, ‘Far better an approximate answer to the right question, than an exact answer to a wrong one.’ Haider’s book never loses sight of this central tenet and uses many realworld examples to guide the reader through the broad range of skills, techniques, and tools needed to succeed in practical data-science. “Highly recommended to anyone looking to get started or broaden their skillset in this fast-growing field.”</blockquote>
And finally, <a href="http://scholar.princeton.edu/atif/home"><b>Professor Atif Mian</b></a>, author of the best-selling book: <i>The <a href="http://www.amazon.ca/House-Debt-Recession-Prevent-Happening/dp/022608194X/ref=sr_1_1?s=books&ie=UTF8&qid=1452710193&sr=1-1&keywords=house+of+debt">House of Debt</a> </i>offered the following assessment:<br /><blockquote class="tr_bq">
“We have produced more data in the last two years than all of human history combined. Whether you are in business, government, academia, or journalism, the future belongs to those who can analyze these data intelligently. This book is a superb introduction to data analytics, a must-read for anyone contemplating how to integrate big data into their everyday decision making.”</blockquote>
— Professor Atif Mian, Theodore A. Wells ’29 Professor of Economics and Public Affairs, <br /><b>Princeton University</b>; Director of the Julis-Rabinowitz Center for Public Policy and Finance at the <b>Woodrow Wilson School</b>.</div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-31292654853160633172015-12-06T17:19:00.000-05:002015-12-06T17:19:11.078-05:00Not so sweet sixteen!<div dir="ltr" style="text-align: left;" trbidi="on">
In the world of big data and real-time analytics, Microsoft users are still living with the constraints of the bygone days of little data and basic numeracy.<br />
<br />
If you happen to use Microsoft Excel for running Regressions, you will soon realize your limits: The Windows version of Excel 2013 permits no more than 16 explanatory variables.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYB-QxCBvmyuuJwBGB-KpZZ8cGMQzsIgJHO4Z3EKDT-hG71xVb1PVtGyzG7cBpZ9_ool3rnQBKL5e-6vKyAtMqYKA6Azuw05jiRRTn-rqDSjB6hW9r8iuYfszABfcOKUnqdrvdXt20d9Q/s1600/excel+sixten.PNG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="96" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYB-QxCBvmyuuJwBGB-KpZZ8cGMQzsIgJHO4Z3EKDT-hG71xVb1PVtGyzG7cBpZ9_ool3rnQBKL5e-6vKyAtMqYKA6Azuw05jiRRTn-rqDSjB6hW9r8iuYfszABfcOKUnqdrvdXt20d9Q/s400/excel+sixten.PNG" width="400" /></a></div>
<br />
Excel has made great progress in expanding its capabilities in the recent past. Unlike the few thousand rows in the past, the current version permits about a million rows per Sheet (a single data set). But when it comes to regression, you may have several thousand observations in the data set, you are still limited by a hard constraint of sixteen explanatory variables.<br />
<br />
Some would argue that for parsimony, we should be content with the restriction. True, but with categorical variables, the number of explanatory variables stretch beyond the artificial constraints set by Microsoft Excel.<br />
<br />
Others might inquire why do statistical analyses in Excel in the first place. Despite the inherent limitations in Microsoft Excel, business schools in particular and other social science undergraduate programs in general, are increasingly turning to Excel to teach courses in statistics. If you were to take a quick look at the curriculum of the undergraduate business and numerous MBA programs, you would realize how widespread is the use of Excel for courses in statistics and analytics.<br />
<br />
At <a href="https://sites.google.com/site/statsr4us/intro/software" target="_blank">Ryerson University</a>, I switched to R years ago for my MBA courses. Thanks to John Fox’s <a href="http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/" target="_blank">R Commander</a>, the transition to R was without much hassle. The students were told in the very beginning that they were now part of the big league, and hiding behind spreadsheets was no longer an option.<br />
<br />
I must mention that Microsoft Excel continues to be my platform of choice for a variety of tasks. I use Excel several times a day, but not for statistical analysis. I am not suggesting that Excel cannot do statistics; I am arguing that it can do a much better job of it.<br />
<br />
As I see it, Microsoft has several options. First is do nothing. After all, Microsoft Excel has no real competition in the Windows environment. Second, it could turn to the team that has programmed the <a href="https://support.office.com/en-us/article/LINEST-function-84d7d0d9-6e50-4101-977a-fa7abf772b6d" target="_blank">linest </a>function in Excel and ask them to add some muscle to it. That will be the wrong approach.<br />
<br />
Instead, Microsoft should explore ways to integrate R or another freeware with Excel to add a complete analytics menu. Microsoft should learn from what the leaders in analytics are already doing. SPSS, an industry leader in analytics category, has already integrated R, allowing the SPSS users to merge the robust data management strengths of SPSS with the state-of-the-art analytics bundled with R. SAS, another big name in analytics, is about to do the same.<br />
<br />
And since Microsoft has recently acquired Revolution R, it makes even more sense to build a bridge between Excel and <a href="http://www.revolutionanalytics.com/revolution-r-open" target="_blank">Revolution R Open</a> (RRO).<br />
<br />
<a href="http://www.springer.com/us/book/9781441900517" target="_blank">R Through Excel</a> is one example of integrating R with Excel. If Microsoft were to put its weight behind the initiative, it could build a seamless coupling with R expanding the analytic capabilities for hundreds of million Excel users. <br />
<br />
As for the SPSS, I recommend they also consider another option. If Microsoft were to integrate RRO with Excel, they could acquire an advanced analytics software and integrate it with SPSS. For this option, I would recommend <a href="http://www.limdep.com/" target="_blank">Limdep</a>, which I have found to be the most diverse software for statistical analysis and econometrics. Even though R is a collective effort of thousands of software developers, Limdep offers numerous routines and post-estimation options that are not available in the thousands of R packages. SPSS integrated with Limdep could become the most diversely capable commercial software in the market as it will bridge the gap with SAS and Stata.<br />
<br />
As for the colleagues in business faculties pondering over what platform to adopt for the analytics/software courses, I would say know your limits, especially with Microsoft Excel while deciding upon the curriculum.</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com1tag:blogger.com,1999:blog-1816488797213954066.post-68865703857502968992015-10-30T11:10:00.000-04:002015-10-30T11:10:04.724-04:00Curious about big data in Montreal?<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyVZZLAAlHelHw8aFni8EYpdXr93N9hdzrnz7xhCaov_nQs_2q8OO4LlnlxtsOVnkjsIbbXuEh4z0LNLVSioRlE1Ilp3ekbuHHz3hURQxbQvSOntPAo4SWJjekHgClyHcVPIVNDzkxKIs/s1600/bdu.PNG" imageanchor="1" style="clear: right; float: right; margin-bottom: 1em; margin-left: 1em;"><img border="0" height="162" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhyVZZLAAlHelHw8aFni8EYpdXr93N9hdzrnz7xhCaov_nQs_2q8OO4LlnlxtsOVnkjsIbbXuEh4z0LNLVSioRlE1Ilp3ekbuHHz3hURQxbQvSOntPAo4SWJjekHgClyHcVPIVNDzkxKIs/s320/bdu.PNG" width="320" /></a></div>
Are you in Montreal and curious about big data? Well here is your chance to attend a session about the same at Concordia University on Tuesday, Nov. 03 at 6:00 pm.<br />
<br />
<a href="http://www.bigdatauniversity.com/">www.BigDataUniversity.com</a>, which is an IBM-led initiative is running meetups across North America to create awareness about, and training in, big data analytics.<br />
<br />
BigDataUniversity runs MOOCs and through its online data scientist workbench provides access to python, R, and even Spark. Also, you can learn about Watson Analytics and see how you can work with the state-of-the-art in computing.<br />
<br />
Further details are available at:<br />
<h4 style="text-align: left;">
Getting started with Data Science and Introduction to Watson Analytics</h4>
<a href="http://www.meetup.com/YUL-Social-Mobile-Analytics-Cloud-Meetup/">http://www.meetup.com/YUL-Social-Mobile-Analytics-Cloud-Meetup/</a><br />
<br />
<div class="MsoPlainText">
When: Tuesday, November 3rd at 6-9 PM<o:p></o:p></div>
<br />
<div class="MsoPlainText">
Where: H1269, 12th floor of the Hall Bldg </div>
<div class="MsoPlainText">
(1455, blvd. De Maisonneuve ouest - Metro Guy-Concordia)<o:p></o:p></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-69167569061802614242015-05-20T15:35:00.000-04:002015-05-20T15:35:30.215-04:00Are Canadian newspapers painting false pictures with data?<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="MsoNormal">
The Canadian newspaper, <i>Globe and Mail</i>, is a leader in diction and style, but it may need improvement in the ‘<a href="http://www.springer.com/gp/book/9780387245447">grammar of graphics’</a>.</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">Globe’s recent depiction of metropolitan economic growth in the series <i>Off the Charts</i> was way off the mark. The chart plotted the current and forecasted GDP growth rates for select cities in Canada. The red-coloured upward sloping lines depicted cities with increasing economic growth rates and the grey-colored downward sloping lines highlighted those with slowing economic growth. <o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">There is, however, a small problem. The chart erroneously showed some slowing economies as growing and vice versa. Furthermore, the trajectory of the sloping lines would mislead the readers to assume that cities with parallel lines enjoyed a similar increase in the growth rate, which, of course, is not true. The graphical faux pas was certainly avoidable had a bar chart were used.</span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://images.huffingtonpost.com/2015-05-20-1432095788-3018485-Globegrowingeconomiesgdp.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://images.huffingtonpost.com/2015-05-20-1432095788-3018485-Globegrowingeconomiesgdp.png" /></a></div>
<div class="MsoNormal">
<v:shapetype coordsize="21600,21600" filled="f" id="_x0000_t75" o:preferrelative="t" o:spt="75" path="m@4@5l@4@11@9@11@9@5xe" stroked="f"> <v:stroke joinstyle="miter"> <v:formulas> <v:f eqn="if lineDrawn pixelLineWidth 0"> <v:f eqn="sum @0 1 0"> <v:f eqn="sum 0 0 @1"> <v:f eqn="prod @2 1 2"> <v:f eqn="prod @3 21600 pixelWidth"> <v:f eqn="prod @3 21600 pixelHeight"> <v:f eqn="sum @0 0 1"> <v:f eqn="prod @6 1 2"> <v:f eqn="prod @7 21600 pixelWidth"> <v:f eqn="sum @8 21600 0"> <v:f eqn="prod @7 21600 pixelHeight"> <v:f eqn="sum @10 21600 0"> </v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:f></v:formulas> <v:path gradientshapeok="t" o:connecttype="rect" o:extrusionok="f"> <o:lock aspectratio="t" v:ext="edit"> </o:lock></v:path></v:stroke></v:shapetype><v:shape id="Picture_x0020_2" o:spid="_x0000_i1027" style="height: 316.5pt; mso-wrap-style: square; visibility: visible; width: 261pt;" type="#_x0000_t75"> <v:imagedata o:title="" src="file:///C:\Users\Murtaza\AppData\Local\Temp\msohtmlclip1\01\clip_image001.png"> </v:imagedata></v:shape><span lang="EN-CA"><o:p></o:p></span></div>
<div class="MsoNormal">
<span lang="EN-CA">Source: <i>The Globe and Mail</i>, Page B6, May 15.<o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">Of course, the <i>Globe and Mail</i> is not alone in coming up with math that simply doesn’t add up. While covering the Scottish independence vote in September 2014, CNN reported that <a href="http://www.dailymail.co.uk/news/article-2761778/Something-doesn-t-add-CNN-Reports-110-turnout-Scottish-independence-vote.html">Scots voted a 110% in the referendum</a> such that 58% voted yes and another 52% voted no.<o:p></o:p></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://images.huffingtonpost.com/2015-05-20-1432095850-8947903-Shouldscotlandbeindependent.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://images.huffingtonpost.com/2015-05-20-1432095850-8947903-Shouldscotlandbeindependent.png" height="208" width="400" /></a></div>
<div class="MsoNormal">
Source: <a href="http://www.dailymail.co.uk/news/article-2761778/Something-doesn-t-add-CNN-Reports-110-turnout-Scottish-independence-vote.html">Mail Online</a>. September 19, 2014</div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">The recent rise of data journalism has witnessed the emergence of data visualization where the editors increasingly reinforce narrative with creative info-graphics. While major news outlets such as <i>The Economist</i>, <i>The New York Times</i>, and the <i>Wall Street Journal</i> retained experts in data science and visualization, most newspapers have entrusted the task to the graphics departments that rely on tools that are not specifically designed for data visualization. At times, the outcome is math- and logic-defying graphics that present a false picture.<o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">Even when charts correctly depict data, at times the visualizations are too complex for the ordinary newsreader to grasp. Powerful data visualizations tools, such as D3 (a JavaScript library) are often abused to create graphics too rich in detail to comprehend. The use of <a href="http://bl.ocks.org/mbostock/1044242">Hierarchical Edge Bundling</a>, for instance, is becoming increasingly popular in the news media resulting in complex graphics that are visually impressive, but conceptually confusing.<o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA"><a href="http://www.edwardtufte.com/tufte/index">Edward Tufte</a> and Leland Wilkinson have spent a lifetime advising data enthusiasts on how to present data-driven information. Wilkinson is the author of <a href="http://www.springer.com/gp/book/9780387245447"><i>The Grammar of Graphics</i></a>, which sets out the fundamentals for presenting data. Wilkinson’s writings inspired Hadley Wickham to develop <a href="http://ggplot2.org/">ggplot2</a>, a graphing engine for R, which is increasingly becoming the tool of choice for data scientists. <o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">Tufte inspired Dona M. Wong, who was the graphics director at the <i>Wall Street Journal</i>. Ms. Wong authored <a href="http://www.amazon.com/Street-Journal-Guide-Information-Graphics/dp/0393072959"><i>The Wall Street Journal Guide to Information Graphics</i></a>. Her book is a quintessential guide for those who work with data and would like to present information as charts. She uses examples from the Journal to illustrate the dos and don’ts of presenting data as info-graphics. <o:p></o:p></span></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<span lang="EN-CA">Let us return to the forecasted metropolitan growth rates in Canada. I prefer the horizontal bar chart instead. The bar chart offers me several options to highlight the main argument in the story. If I were interested in highlighting cities with the highest gains in growth since 2014, I would sort the cities accordingly, as is illustrated in the graphic on the left (see below). If I were interested in highlighting cities with the highest forecasted growth rate, I would sort them accordingly to result in the graphic on the right.</span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="http://images.huffingtonpost.com/2015-05-20-1432096033-621065-Barchartgrowingeconomies.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://images.huffingtonpost.com/2015-05-20-1432096033-621065-Barchartgrowingeconomies.png" height="243" width="400" /></a></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
<v:shape id="Picture_x0020_5" o:spid="_x0000_i1025" style="height: 285.75pt; mso-wrap-style: square; visibility: visible; width: 468pt;" type="#_x0000_t75"> <v:imagedata o:title="" src="file:///C:\Users\Murtaza\AppData\Local\Temp\msohtmlclip1\01\clip_image003.png"> </v:imagedata></v:shape><span lang="EN-CA"><o:p></o:p></span></div>
<div class="MsoNormal">
Dana Wong insists on simplicity in rendering. She concludes her book with a simple message for data visualization: simplify, simplify, simplify. The two bar charts simplify the same information presented by the Globe. The results are obvious: I avoid misrepresenting data. One can readily see Halifax’s economy is forecasted to grow and Vancouver’s to shrink. The Globe’s rendering depicted exactly the opposite.</div>
<div class="MsoNormal">
<br /></div>
<br />
<div class="MsoNormal">
<br /></div>
</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com2tag:blogger.com,1999:blog-1816488797213954066.post-8389352086038289252015-04-23T13:24:00.001-04:002015-04-23T13:24:07.357-04:00UP Express in Toronto: A train less ridden<div dir="ltr" style="text-align: left;" trbidi="on">
What does a billion dollars' worth of transit investment get in Toronto? A piddly 5,000 daily riders. To put things in perspective, dozens of bus routes in Toronto carry more passengers every day than the trips forecasted for the Union-Pearson rail link (UP Express).<br />
<br />
The rail link will connect Canada's two busiest transport hubs: The Union Station and the Pearson Airport. Despite the high-speed connector between the two busiest hubs, transport authorities expect only 5,000 daily riders on the UP Express. The King Streetcar, in comparison, carries in excess of 65,000 daily riders.<br />
<br />
The UP Express and the Sheppard subway extension are examples of transit money well wasted. A 2009 communiqué by Metrolinx estimated that the George Town Expansion (including the UP Express) will cost over a billion dollars. The <i>Globe and Mail</i> reported <a href="http://globe2go.newspaperdirect.com/epaper/viewer.aspx?noredirect=true" target="_blank">Ontario government alone had invested $456 million</a> in the UP Express. Instead of investing the scarce transit dollars on projects likely to deliver the highest increase in transit ridership, billions are being spent on projects that will have a marginal impact on addressing traffic congestion in the GTA.<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://www.upexpress.com/images/ArrivaloftheTrain.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" src="http://www.upexpress.com/images/ArrivaloftheTrain.jpg" height="225" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Source: www.upexpress.com</td></tr>
</tbody></table>
With $29 billion in planned transport infrastructure investments, some of which will be publicised Thursday in the Ontario budget, the Province and the City need to have their priorities right. The very least would be to stop investing in projects that do not generate sufficient transit ridership.<br />
<br />
One may argue that 5,000 fewer trips by automobile to and from the Airport should help in easing congestion in the GTA. However, with over 12-million daily trips in the GTA, 5,000 fewer trips are unlikely to make any meaningful difference in traffic congestion. At the same time, the taxpayers should focus on the cost-benefit trade-offs for transit investments. Notice the cost-benefit efficiency of the existing TTC bus service (192 Airport Rocket) to the Pearson Airport that carries over 4,000 daily passengers. A billion dollars later, the UP Express will move only one thousand additional riders.<br />
<br />
In North America, fewer than 10 airports are connected with local subway or regional rail transit. With the exception of the Ronald Reagan International Airport in Washington, DC, most other airports accessible by rail report approximately 5% transit trips to and from airports. The European experience though has been better. Almost 35% of the trips to and from Zurich airport were made on rail-based transit. Munich airport reported 40% of the trips by rail and bus.<br />
<br />
Certain transit network attributes, which are missing for the UP Express, contribute to the strong transit ridership to and from airports. For instance, the rail-based service to high transit ridership airports does not terminate at the airport but instead continues further to serve the communities along the corridor. In addition, the airport lines at the successful airports are integrated with the rest of the rail-based transit system, instead of being a standalone line. The UP Express is a standalone rail line that connects to only one terminal at Pearson Airport. The prohibitive fare makes the ride uneconomical for commuters travelling in teams of two or more who would find a cab ride cheaper and convenient from most parts of suburban Toronto.<br />
<br />
Two other key factors limit the ridership potential of the UP Express. First, the Billy Bishop Airport near downtown Toronto caters to the short-haul business travel market. It has been argued in the past that business travellers originating in downtown Toronto would rather take the train than a cab to Pearson Airport. Given the frequency of service and choice of destinations served by the Billy Bishop Airport, business travellers increasingly favour the downtown airport, which eats into the UP Express potential market share.<br />
<br />
In addition, the peak operations at Pearson Airport coincide with the morning and afternoon peak commuting times in Toronto. This implies that one would have to commute to Union Station in the morning and afternoon peak travel periods to ride the UP Express. The extra effort in time and money required to travel to downtown Toronto from the inner suburbs alone will deter riders from using the Union-Person rail link.<br />
<br />
The UP Express is yet another monument dedicated to public transit misadventures while the region continues to suffer from gridlock. Getting the transit priorities right is necessary before Ontario dolls out $29 billion.</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-27625685843673933872015-04-09T13:50:00.000-04:002015-04-09T13:50:17.712-04:00Stata embraces Bayesian statistics<div dir="ltr" style="text-align: left;" trbidi="on">
<a href="http://www.stata.com/stata14/" target="_blank">Stata 14</a> has just been released. The new and big thing with version 14 is the introduction of <a href="http://www.stata.com/stata14/bayesian-analysis/" target="_blank">Bayesian Statistics</a>. A wide variety of new models can now be estimated with Stata by combining 10 likelihood models, 18 prior distributions, different types of outcomes, and multiple equation models. Stata has also made available a <a href="http://www.stata.com/manuals14/bayes.pdf" target="_blank">255-page reference manual</a> for free to illustrate Bayesian statistical analysis.<br />
<br />
Of course R already offered numerous options for <a href="http://cran.r-project.org/web/views/Bayesian.html" target="_blank">Bayesian Inference</a>. It will be interesting to hear from colleagues proficient in Bayesian statistics to compare Stata’s newly added functionality with what has already been available from R.<br />
<br />
Given the hype with big data and the newly generated demand for data mining and advanced analytics, it would have been timely for Stata to also add data mining and machine learning algorithms. My two cents: data mining algorithms are in greater demand than Bayesian statistics. Stata users will have to wait for a year or more to see such capabilities. In the meanwhile, R offers several options for data mining and <a href="http://cran.r-project.org/web/views/MachineLearning.html" target="_blank">machine learning algorithms</a>.</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com3tag:blogger.com,1999:blog-1816488797213954066.post-40220855919523219842014-11-03T13:56:00.000-05:002014-11-03T13:57:15.167-05:00R: What to do when help doesn't arrive<div dir="ltr" style="text-align: left;" trbidi="on">
R is great when it works. Not so much, when it doesn’t. Specifically, this becomes a concern when the packages are not fully illustrated in the accompanied help documentation, and the author/s of the package don’t respond (in time).<br />
<br />
I am not suggesting that the package authors should respond to every email that they receive. My request is that the documentation should be complete enough so that the authors’ help is no longer required on a day-to-day basis.<br />
<br />
Recently, a colleague in the US and I became interested in the <a href="http://cran.r-project.org/web/packages/mlogit/index.html" target="_blank">mlogit package</a>. We wanted to use the weights option in the package. Just like most other packages, mlogit does not illustrate how to use weights, but advises that the option is available. We assumed that the <a href="http://facweb.knowlton.ohio-state.edu/pviton/courses2/crp5700/5700-mlogit.pdf" target="_blank">weights would work in a certain way</a> (see page 26 of the hyperlinked document). However, when I estimated the model with weights, mlogit did not replicate the results from a popular textbook on econometrics. Here are the details.<br />
<br />
We wanted to see if the the weights option could be used in an <a href="http://facweb.knowlton.ohio-state.edu/pviton/courses2/crp5700/5700-mlogit.pdf" target="_blank">alternative specific logit formulation</a> when the sampled data do not conform to the market shares observed in the underlying population? For instance, in a travel choice model, one may be tempted to over sample train commuters and under-sample car commuters because often car commuters far outnumber the train commuters for inter-city travel in the underlying population. This is true for most of Canada and the US. In such circumstances, we would weight the data set so that the estimated model reproduces the population market shares rather than the sample shares.<br />
<br />
The commercially available software, <a href="https://www.limdep.com/products/nlogit/" target="_blank">NLogit/LimDep</a> can do this with ease. I wanted to replicate the results for choice-based weights for the conditional logit model in Professor Bill Greene's book, <em><a href="http://pages.stern.nyu.edu/~wgreene/Text/econometricanalysis.htm" target="_blank">Econometric Analysis</a></em>. This is illustrated on page 853 of the <strong>6th edition</strong> of the text where Table 23.24 presents the parameter estimates for a conditional (McFadden) logit model for the un-weighted and the choice-based weighted versions. I replicated the results using NLogit with a simple addition of population market shares in the two-line syntax. However, the results generated by mlogit package bear no resemblance to the ones listed in <em>Econometric Analysis</em>.<br />
<br />
It turns out that <a href="http://www.stata.com/features/binary-limited-outcomes/" target="_blank">Stata</a> is also limited in the way it handles weights for the two estimation options: <a href="http://www.stata.com/manuals13/rasclogit.pdf" target="_blank">asclogit</a> and <a href="http://www.stata.com/help.cgi?clogit" target="_blank">clogit</a>. I know this because colleagues at Stata were quite diligent in responding to my requests. It’s not the same with the mlogit, which may or may not be able to handle the weights. We will only know when the author responds.<br />
<br />
I am recommending that it should not be left to the individual authors to bear the sole responsibility for supporting the R packages. The individual could be ill, busy, or unavailable for a variety of reasons. This limitation could be proactively dealt with if the R community collectively generates help documentation by detailed worked-out examples of all available options (including weights), and not the few frequently used ones.<br />
<br />
Improving documentation will be key to helping R branch out to the everyday users of statistical analysis. The tech-savvy can iron out the kinks. They have the curiosity, patience, and time on their hand. The rest of the world is not that fortunate.<br />
<br />
I propose that users of the packages, and not just the authors, should collaborate to generate help documentation as vignettes and YouTube videos. This will do more in popularizing R than another 6,000 new packages that few may know how to work with.</div>
Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-36968854668839399202013-12-08T22:35:00.001-05:002013-12-08T22:44:17.375-05:00Summarize statistics by Groups in R & R Commander<p>R is great at accomplishing complex tasks. Doing simple things with R though takes some effort. Consider the simple task of producing summary statistics for continuous variables over some factor variables. Using Stata, I’d write a brief one-liner to get the mean for one or more variables using another variable as a factor. For instance, <strong><font face="Courier New">tabstat Horsepower RPM, by(Type)</font></strong>  in Stata produces the following:</p> <p><a href="http://lh3.ggpht.com/-LuekJE2D_yo/UqU6ejVL3xI/AAAAAAAABY0/XWzB2qeeizE/s1600-h/image%25255B3%25255D.png"><img title="image" style="border-left-width: 0px; border-right-width: 0px; background-image: none; border-bottom-width: 0px; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-top-width: 0px" border="0" alt="image" src="http://lh6.ggpht.com/-B_S3dicYzDA/UqU6fwIlrgI/AAAAAAAABY8/mGyY1Mu93hY/image_thumb%25255B1%25255D.png?imgmax=800" width="395" height="374" /></a></p> <p>The <a href="http://cran.r-project.org/web/packages/doBy/vignettes/doBy.pdf">doBy</a><strong></strong> package in R offers similar functionality and more. Of particular interest for those who teach R based statistics courses in the undergraduate programs is the <strong>doBy plugin</strong> for R Commander. The plugin was developed by <a href="http://www.compmath.com/blog/projects/rcmdrplugin-doby/"><strong>Jonathan Lee</strong></a> and it is a great tool for teaching and for quick data analysis. To get the same output as the one listed above, I’d click on the <strong>doBy plugin</strong> to get the following dialogue box:</p> <p><a href="http://lh3.ggpht.com/-MJir5L8WvRE/UqU6hJSxZXI/AAAAAAAABZE/nVs1f9SgHTw/s1600-h/image%25255B7%25255D.png"><img title="image" style="border-left-width: 0px; border-right-width: 0px; background-image: none; border-bottom-width: 0px; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-top-width: 0px" border="0" alt="image" src="http://lh4.ggpht.com/-4WocpO2WhBw/UqU6iC12UmI/AAAAAAAABZM/FmPLf6xQNPE/image_thumb%25255B3%25255D.png?imgmax=800" width="408" height="324" /></a></p> <p>The dialogue box results in the following simple syntax:</p> <p><font face="Courier New"><strong>summaryBy(Horsepower+RPM~Type, data=Cars93, FUN=c(mean))</strong></font></p> <p>You may first have to load the data set: <br /><font face="Courier New"><strong>data(Cars93, package="MASS")</strong></font></p> <p>And the results are presented below:</p> <p><a href="http://lh3.ggpht.com/-MH5I8ZhVKGQ/UqU6jaYwnuI/AAAAAAAABZU/MaNZ00-q3ck/s1600-h/image%25255B12%25255D.png"><img title="image" style="border-left-width: 0px; border-right-width: 0px; background-image: none; border-bottom-width: 0px; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-top-width: 0px" border="0" alt="image" src="http://lh3.ggpht.com/-JHITMufspII/UqU6kOzxnBI/AAAAAAAABZc/3Uu8fG0LB5k/image_thumb%25255B6%25255D.png?imgmax=800" width="411" height="191" /></a></p> <p>Jonathan has also created GUIs for order by, sample by, and split by within the same plug-in. A must use plug-in for data scientists.</p> Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com0tag:blogger.com,1999:blog-1816488797213954066.post-48470399061470627462013-12-08T12:07:00.001-05:002013-12-08T18:06:04.699-05:00Comparing mnlogit and mlogit for discrete choice models<p>Earlier this weekend (Dec. 7, 2013),<span class="Apple-converted-space"> </span>mnlogit<span class="Apple-converted-space"> </span>was released on CRAN by Wang Zhiyu and Asad Hasan (<a style="color: " href="mailto:asad.hasan@sentrana.com">asad.hasan@sentrana.com</a>) claiming that mnlogit uses “parallel C++ library to achieve fast computation of Hessian matrices”.</p> <p style="font-family: ; white-space: normal; word-spacing: 0px; text-transform: none; color: ; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px">Here is a comparison of<span class="Apple-converted-space"> </span>mnlogit<span class="Apple-converted-space"> </span>with<span class="Apple-converted-space"> </span>mlogit<span class="Apple-converted-space"> </span>by Yves Croissant whose package seems to be the inspiration for mnlogit.</p> <p style="font-family: ; white-space: normal; word-spacing: 0px; text-transform: none; color: ; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px">I will estimate the same model using the same data set and will compare the two packages for execution speed, specification flexibility, and ease of use.</p> <h3>Data set</h3> <p style="font-family: ; white-space: normal; word-spacing: 0px; text-transform: none; color: ; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px">I use the Fish data set to estimate mnlogit and mlogit. mnlogit defines the data set as follows:</p> <p style="font-family: ; white-space: normal; word-spacing: 0px; text-transform: none; color: ; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px">A data frame containing :</p> <ul> <li>mode - The choice set: beach, pier, boat, and charter </li> <li>price - price for a mode for an individual </li> <li>catch - fish catch rate for a mode for an individual </li> <li>income - monthly income of the individual decision-maker </li> <li>chid - decision maker ID </li> </ul> <p style="font-family: ; white-space: normal; word-spacing: 0px; text-transform: none; color: ; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px">The authors mention that they have sourced the data from R package mlogit by Yves Croissant, which lists the source as:</p> <ul> <li>Herriges, J. A. and C. L. Kling (1999) “Nonlinear Income Effects in Random Utility Models”,<span class="Apple-converted-space"> </span>Review of Economics and Statistics, 81, 62-72. </li> </ul> <h3>Estimation with mnlogit</h3> <pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code class="r" style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><span class="keyword" style="color: "><font color="#0000ff"><font style="font-size: 9pt; background-color: #f8f8f8">library</font></font></span><font style="font-size: 9pt"><font style="background-color: #f8f8f8"><span class="paren" style="color: "><font color="#687687">(</font></span><span class="identifier" style="color: "><font color="#000000">mnlogit</font></span><span class="paren" style="color: "><font color="#687687">)</font></span><br /></font></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><font style="font-size: 9pt" color="#000000">## Warning: package 'mnlogit' was built under R version 3.0.2<br /></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><font style="font-size: 9pt" color="#000000">## Package: mnlogit Version: 1.0 Multinomial Logit Choice Models. Scientific<br />## Computing Group, Sentrana Inc, 2013.<br /></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code class="r" style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font style="font-size: 9pt"><br /></font><font face="Lucida Console"><font style="font-size: 9pt"><font style="background-color: #f8f8f8"><span class="identifier" style="color: "><font color="#000000">data</font></span><span class="paren" style="color: "><font color="#687687">(</font></span><font color="#000000"><span class="identifier" style="color: ">Fish</span>, <span class="identifier" style="color: ">package</span> </font><span class="operator" style="color: "><font color="#687687">=</font></span><font color="#000000"> </font><span class="string" style="color: "><font color="#036a07">"mnlogit"</font></span><span class="paren" style="color: "><font color="#687687">)</font></span><br /><font color="#000000"><span class="identifier" style="color: ">fm</span> </font><span class="operator" style="color: "><font color="#687687"><-</font></span><font color="#000000"> <span class="identifier" style="color: ">formula</span></font><span class="paren" style="color: "><font color="#687687">(</font></span><font color="#000000"><span class="identifier" style="color: ">mode</span> </font><span class="operator" style="color: "><font color="#687687">~</font></span><font color="#000000"> </font><span class="number" style="color: "><font color="#0000cd">1</font></span><font color="#000000"> </font><span class="operator" style="color: "><font color="#687687">|</font></span><font color="#000000"> <span class="identifier" style="color: ">income</span> </font><span class="operator" style="color: "><font color="#687687">|</font></span><font color="#000000"> <span class="identifier" style="color: ">price</span> </font><span class="operator" style="color: "><font color="#687687">+</font></span><font color="#000000"> <span class="identifier" style="color: ">catch</span></font><span class="paren" style="color: "><font color="#687687">)</font></span><br /><span class="identifier" style="color: "><font color="#000000">summary</font></span><span class="paren" style="color: "><font color="#687687">(</font></span><span class="identifier" style="color: "><font color="#000000">mnlogit</font></span><span class="paren" style="color: "><font color="#687687">(</font></span><font color="#000000"><span class="identifier" style="color: ">fm</span>, <span class="identifier" style="color: ">Fish</span>, </font><span class="string" style="color: "><font color="#036a07">"alt"</font></span><font color="#687687"><span class="paren" style="color: ">)</span><span class="paren" style="color: ">)</span></font><br /></font></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><font style="font-size: 9pt" color="#000000">## <br />## Call:<br />## mnlogit(formula = fm, data = Fish, choiceVar = "alt")<br />## <br />## Frequencies of alternatives in input data:<br />## beach boat charter pier <br />## 0.113 0.354 0.382 0.151 <br />## <br />## Number of observations in training data = 1182<br />## Number of alternatives = 4<br />## Intercept turned: ON.<br />## Number of parameters in model = 14<br />## # individual specific variables = 2<br />## # choice specific coeff variables = 2<br />## # generic coeff variables = 0<br />## <br />## Maximum likelihood estimation using Newton-Raphson iterations.<br />## Number of iterations: 7<br />## Number of linesearch iterations: 7<br />## At termination: <br />## Gradient 2-norm = 8.19688645245397e-05<br />## Diff between last 2 loglik values = 4.15543581766542e-08<br />## Stopping reason: Succesive loglik difference < ftol (1e-06).<br />## Total estimation time (sec): 0.04<br />## Time for Hessian calculations (sec): 0.04 using 1 processors.<br />## <br />## Coefficients : <br />## Estimate Std.Error t-value Pr(>|t|) <br />## (Intercept):boat 8.64e-01 3.15e-01 2.74 0.00607 ** <br />## (Intercept):charter 1.85e+00 3.10e-01 5.97 2.4e-09 ***<br />## (Intercept):pier 1.13e+00 3.05e-01 3.71 0.00021 ***<br />## income:boat -1.10e-04 6.02e-05 -1.84 0.06636 . <br />## income:charter -2.78e-04 6.03e-05 -4.61 4.0e-06 ***<br />## income:pier -1.28e-04 5.33e-05 -2.41 0.01605 * <br />## price:beach -3.80e-02 3.33e-03 -11.41 < 2e-16 ***<br />## price:boat -2.09e-02 2.23e-03 -9.33 < 2e-16 ***<br />## price:charter -1.60e-02 2.02e-03 -7.94 2.0e-15 ***<br />## price:pier -3.92e-02 3.26e-03 -12.02 < 2e-16 ***<br />## catch:beach 4.95e+00 8.20e-01 6.04 1.5e-09 ***<br />## catch:boat 2.47e+00 5.19e-01 4.76 1.9e-06 ***<br />## catch:charter 7.61e-01 1.52e-01 4.99 6.0e-07 ***<br />## catch:pier 4.88e+00 8.99e-01 5.43 5.5e-08 ***<br />## ---<br />## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1<br />## <br />## Log-Likelihood: -1160, df = 14<br />## AIC: 13.8875709217033<br /></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code class="r" style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><span class="identifier" style="color: "><font color="#000000"><font style="font-size: 9pt; background-color: #f8f8f8">system.time</font></font></span><font style="font-size: 9pt"><font style="background-color: #f8f8f8"><font color="#687687"><span class="paren" style="color: ">(</span><span class="paren" style="color: ">(</span></font><span class="identifier" style="color: "><font color="#000000">mnlogit</font></span><span class="paren" style="color: "><font color="#687687">(</font></span><font color="#000000"><span class="identifier" style="color: ">fm</span>, <span class="identifier" style="color: ">Fish</span>, </font><span class="string" style="color: "><font color="#036a07">"alt"</font></span><font color="#687687"><span class="paren" style="color: ">)</span><span class="paren" style="color: ">)</span><span class="paren" style="color: ">)</span></font><br /></font></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><font style="font-size: 9pt" color="#000000">## user system elapsed <br />## 0.11 0.00 0.11<br /></font></font></code></pre><br /><br /><h3>Estimation with mlogit</h3><br /><br /><p style="font-family: ; white-space: normal; word-spacing: 0px; text-transform: none; color: ; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px">I estimate the same model using mlogit.</p><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code class="r" style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><span class="keyword" style="color: "><font color="#0000ff"><font style="font-size: 9pt; background-color: #f8f8f8">library</font></font></span><font style="font-size: 9pt"><font style="background-color: #f8f8f8"><span class="paren" style="color: "><font color="#687687">(</font></span><span class="identifier" style="color: "><font color="#000000">mlogit</font></span><span class="paren" style="color: "><font color="#687687">)</font></span><br /></font></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><font style="font-size: 9pt" color="#000000">## Loading required package: Formula Loading required package: statmod<br />## Loading required package: lmtest Loading required package: zoo<br />## <br />## Attaching package: 'zoo'<br />## <br />## The following object is masked from 'package:base':<br />## <br />## as.Date, as.Date.numeric<br />## <br />## Loading required package: maxLik Loading required package: miscTools<br />## Loading required package: MASS<br /></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code class="r" style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font style="font-size: 9pt"><br /><font face="Lucida Console"><font style="background-color: #f8f8f8"><font color="#000000"><span class="identifier" style="color: ">data2</span> </font><span class="operator" style="color: "><font color="#687687"><-</font></span><font color="#000000"> <span class="identifier" style="color: ">mlogit.data</span></font><span class="paren" style="color: "><font color="#687687">(</font></span><font color="#000000"><span class="identifier" style="color: ">Fish</span>, <span class="identifier" style="color: ">choice</span> </font><span class="operator" style="color: "><font color="#687687">=</font></span><font color="#000000"> </font><span class="string" style="color: "><font color="#036a07">"alt"</font></span><font color="#000000">, <span class="identifier" style="color: ">shape</span> </font><span class="operator" style="color: "><font color="#687687">=</font></span><font color="#000000"> </font><span class="string" style="color: "><font color="#036a07">"long"</font></span><font color="#000000">, <span class="identifier" style="color: ">id.var</span> </font><span class="operator" style="color: "><font color="#687687">=</font></span><font color="#000000"> </font><span class="string" style="color: "><font color="#036a07">"chid"</font></span></font></font></font><font face="Lucida Console"><font style="font-size: 9pt"><font style="background-color: #f8f8f8"><font color="#000000">, <br /> <span class="identifier" style="color: ">alt.levels</span> </font><span class="operator" style="color: "><font color="#687687">=</font></span><font color="#000000"> <span class="identifier" style="color: ">c</span></font><span class="paren" style="color: "><font color="#687687">(</font></span><span class="string" style="color: "><font color="#036a07">"beach"</font></span><font color="#000000">, </font><span class="string" style="color: "><font color="#036a07">"boat"</font></span><font color="#000000">, </font><span class="string" style="color: "><font color="#036a07">"charter"</font></span><font color="#000000">, </font><span class="string" style="color: "><font color="#036a07">"pier"</font></span><font color="#687687"><span class="paren" style="color: ">)</span><span class="paren" style="color: ">)</span></font><br /><span class="identifier" style="color: "><font color="#000000">summary</font></span><span class="paren" style="color: "><font color="#687687">(</font></span><font color="#000000"><span class="identifier" style="color: ">mod1</span> </font><span class="operator" style="color: "><font color="#687687"><-</font></span><font color="#000000"> <span class="identifier" style="color: ">mlogit</span></font><span class="paren" style="color: "><font color="#687687">(</font></span><font color="#000000"><span class="identifier" style="color: ">mode</span> </font><span class="operator" style="color: "><font color="#687687">~</font></span><font color="#000000"> </font><span class="number" style="color: "><font color="#0000cd">1</font></span><font color="#000000"> </font><span class="operator" style="color: "><font color="#687687">|</font></span><font color="#000000"> <span class="identifier" style="color: ">income</span> </font><span class="operator" style="color: "><font color="#687687">|</font></span><font color="#000000"> <span class="identifier" style="color: ">price</span> </font><span class="operator" style="color: "><font color="#687687">+</font></span><font color="#000000"> <span class="identifier" style="color: ">catch</span>, <span class="identifier" style="color: ">data2</span>, <span class="identifier" style="color: ">reflevel</span> </font><span class="operator" style="color: "><font color="#687687">=</font></span><font color="#000000"> </font><span class="string" style="color: "><font color="#036a07">"beach"</font></span><font color="#687687"><span class="paren" style="color: ">)</span><span class="paren" style="color: ">)</span></font><br /></font></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><font style="font-size: 9pt" color="#000000">## <br />## Call:<br />## mlogit(formula = mode ~ 1 | income | price + catch, data = data2, <br />## reflevel = "beach", method = "nr", print.level = 0)<br />## <br />## Frequencies of alternatives:<br />## beach boat charter pier <br />## 0.113 0.354 0.382 0.151 <br />## <br />## nr method<br />## 7 iterations, 0h:0m:0s <br />## g'(-H)^-1g = 8.31E-08 <br />## gradient close to zero <br />## <br />## Coefficients :<br />## Estimate Std. Error t-value Pr(>|t|) <br />## boat:(intercept) 8.64e-01 3.15e-01 2.74 0.00607 ** <br />## charter:(intercept) 1.85e+00 3.10e-01 5.97 2.4e-09 ***<br />## pier:(intercept) 1.13e+00 3.05e-01 3.71 0.00021 ***<br />## boat:income -1.10e-04 6.02e-05 -1.84 0.06636 . <br />## charter:income -2.78e-04 6.03e-05 -4.61 4.0e-06 ***<br />## pier:income -1.28e-04 5.33e-05 -2.41 0.01605 * <br />## beach:price -3.80e-02 3.33e-03 -11.41 < 2e-16 ***<br />## boat:price -2.09e-02 2.23e-03 -9.33 < 2e-16 ***<br />## charter:price -1.60e-02 2.02e-03 -7.94 2.0e-15 ***<br />## pier:price -3.92e-02 3.26e-03 -12.02 < 2e-16 ***<br />## beach:catch 4.95e+00 8.20e-01 6.04 1.5e-09 ***<br />## boat:catch 2.47e+00 5.19e-01 4.76 1.9e-06 ***<br />## charter:catch 7.61e-01 1.52e-01 4.99 6.0e-07 ***<br />## pier:catch 4.88e+00 8.99e-01 5.43 5.5e-08 ***<br />## ---<br />## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1<br />## <br />## Log-Likelihood: -1160<br />## McFadden R^2: 0.225 <br />## Likelihood ratio test : chisq = 675 (p.value = <2e-16)<br /></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code class="r" style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><span class="identifier" style="color: "><font color="#000000"><font style="font-size: 9pt; background-color: #f8f8f8">system.time</font></font></span><font style="font-size: 9pt"><font style="background-color: #f8f8f8"><span class="paren" style="color: "><font color="#687687">(</font></span><span class="identifier" style="color: "><font color="#000000">mlogit</font></span><span class="paren" style="color: "><font color="#687687">(</font></span><font color="#000000"><span class="identifier" style="color: ">mode</span> </font><span class="operator" style="color: "><font color="#687687">~</font></span><font color="#000000"> </font><span class="number" style="color: "><font color="#0000cd">1</font></span><font color="#000000"> </font><span class="operator" style="color: "><font color="#687687">|</font></span><font color="#000000"> <span class="identifier" style="color: ">income</span> </font><span class="operator" style="color: "><font color="#687687">|</font></span><font color="#000000"> <span class="identifier" style="color: ">price</span> </font><span class="operator" style="color: "><font color="#687687">+</font></span><font color="#000000"> <span class="identifier" style="color: ">catch</span>, <span class="identifier" style="color: ">data2</span>, <span class="identifier" style="color: ">reflevel</span> </font><span class="operator" style="color: "><font color="#687687">=</font></span><font color="#000000"> </font><span class="string" style="color: "><font color="#036a07">"beach"</font></span><font color="#687687"><span class="paren" style="color: ">)</span><span class="paren" style="color: ">)</span></font><br /></font></font></font></code></pre><br /><br /><pre style="max-width: 95%; border-top: rgb(204,204,204) 1px solid; font-family: ; border-right: rgb(204,204,204) 1px solid; white-space: pre-wrap; word-spacing: 0px; border-bottom: rgb(204,204,204) 1px solid; text-transform: none; color: ; border-left: rgb(204,204,204) 1px solid; margin-top: 0px; letter-spacing: normal; line-height: normal; text-indent: 0px; -webkit-text-stroke-width: 0px"><code style="font-family: ; padding-bottom: 6px; padding-top: 6px; padding-left: 6px; display: block; padding-right: 6px"><font face="Lucida Console"><font style="font-size: 9pt" color="#000000">## user system elapsed <br />## 0.27 0.02 0.28<br /></font></font></code></pre><br /><br /><h3>Speed Tests</h3><br /><br /><p>I conducted a simple test for execution times with the command<span class="Apple-converted-space"> </span>system.time. The results are reported above after each model summary.</p><br /><br /><h3>Findings</h3><br /><br /><li>I obtain identical results for the models estimated with<span class="Apple-converted-space"> </span>mnlogit<span class="Apple-converted-space"> </span>and<span class="Apple-converted-space"> </span>mlogit. </li><br /><br /><li>Estimation speeds appear faster for mnlogit. </li><br /><br /><li><strong>caveat</strong>: The same command (system.time) when run independently of the R Markdown environment shows mlogit to be faster than mnlogit! </li><br /><br /><li><strong>Additional Comments</strong>: I restarted RStudio and estimated the two models again outside of R Markdown. mnlogit took  0.12 seconds versus 0.31 seconds for mlogit. </li><br /><br /><li><strong>Verdict</strong>: mnlogit reports shorter execution times than mlogit. </li><br /><br /><li>Also, Estimation speeds may differ with large and complex data sets. </li><br /><br /><li>The model specification is simpler in mnlogit. </li><br /><br /><li>The commands for specifying the data set and the model seem easier in mnlogit. </li><br /><br /><li>mlogit syntax seems relatively complex, but offers more choices in model specification. <br /> <h3>Final comment</h3><br /><br /> <p>I am delighted to see yet another package for discrete choice models. As the field evolves and more advanced problems are subjected to discrete choice models, the need for robust and computationally efficient packages will be felt more. Both<span class="Apple-converted-space"> </span>mlogit<span class="Apple-converted-space"> </span>and<span class="Apple-converted-space"> </span>mnlogit<span class="Apple-converted-space"> </span>are indeed valuable to those interested in how humans make good and bad choices!</p><br /><br /> <br /></li> Murtaza Haiderhttp://www.blogger.com/profile/11315309304368143831noreply@blogger.com1