Saturday, March 10, 2018

R: simple for complex tasks, complex for simple tasks

When it comes to undertaking complex data science projects, R is the preferred choice for many. Why? Because handling complex tasks is simpler in R than other comparable platforms.

Regrettably, the same is not true for performing simpler tasks, which I would argue is rather complex in base R. Hence, the title -- R: simple for complex tasks, complex for simple tasks.

Consider a simple, yet mandatory, task of generating summary statistics for a research project involving tabular data. In most social science disciplines, summary statistics for continuous variables require generating mean, standard deviation, number of observations, and perhaps minimum and maximum. One would have hoped to see a function in base R to generate such table. But there isn’t one.

Of course, several user-written packages, such as psyche, can generate descriptive statistics in a tabular format. However, this requires one to have advanced knowledge of R and the capabilities hidden in specialized packages whose number now exceed 12,000 (as of March 2018). Keeping abreast of the functionality embedded in user-written packages is time-consuming.

Some would argue that the summary command in base R is an option. I humbly disagree.

First, the output from summary is not in a tabular format that one could just copy and paste into a document. It would require significant processing before a simple table with summary statistics for more than one continuous variable could be generated. Second, summary command does not report standard deviation.

I teach business analytics to undergraduate and MBA students. While business students need to know statistics, they are not studying to become statisticians. Their goal in life is to be informed and proficient consumers of statistical analysis.

So, imagine an undergraduate class with 150 students learning to generate a simple table that reports summary statistics for more than one continuous variable. The simple task requires knowledge of several R commands. By the time one teaches these commands to students, most have made up their mind to do the analysis in Microsoft Excel instead.

Had there been a simple command to generate descriptive statistics in base R, this would not be a challenge for instructors trying to bring thousands more into R’s fold.

In the following paragraphs, I will illustrate the challenge with an example and identify an R package that generates a simple table of descriptive statistics.

I use mtcars dataset, which is available with R. The following commands load the dataset and display the first few observations with all the variables.

data(mtcars)
head(mtcars)

As stated earlier, one can use summary command to produce descriptive statistics.

summary(mtcars)

Let’s say one would like to generate descriptive statistics including mean, standard deviation, and the number of observations for the following continuous variables: mpg, disp, and hp. One can use the sapply command and generate the three statistics separately and combined them later using the cbind command.

The following command will create a vector of means.

mean.cars = with(mtcars, sapply(mtcars[c("mpg", "disp",  "hp")], mean))

Note that the above syntax requires someone learning R to know the following:

1.    Either to attach the dataset or to use with command so that sapply could recognize variables.
2.    Knowledge of subsetting variables in R
3.    Familiarity with c to combine variables
4.    Being aware of enclosing variable names in quotes

We can use similar syntax to determine standard deviation and the number of observations.

sd.cars = with(mtcars, sapply(mtcars[c("mpg", "disp",  "hp")], sd)); sd.cars
n.cars = with(mtcars, sapply(mtcars[c("mpg", "disp",  "hp")], length)); n.cars

Note that the user needs to know that the command for number of observations is length and for standard deviation is sd.

Once we have the three vectors, we can combine them using cbind that generates the following table.

cbind(n.cars, mean.cars, sd.cars)

     n.cars mean.cars    sd.cars
mpg      32  20.09062   6.026948
disp     32 230.72188 123.938694
hp       32 146.68750  68.562868

Again, one needs to know the round command to restrict the output to a specific number of decimals. See below the output with two decimal points.

round(cbind(n.cars, mean.cars, sd.cars),2)

     n.cars mean.cars sd.cars
mpg      32     20.09    6.03
disp     32    230.72  123.94
hp       32    146.69   68.56

One can indeed use a custom function to generate the same with one command. See below.

round(with(mtcars, t(sapply(mtcars[c("mpg", "disp",  "hp")],
                    function(x) c(n=length(x), avg=mean(x),
                    stdev=sd(x))))), 2)

      n    avg  stdev
mpg  32  20.09   6.03
disp 32 230.72 123.94
hp   32 146.69  68.56

But the question I have for my fellow instructors is the following. How likely is an undergraduate student taking an introductory course in statistical analysis to be enthused about R if the simplest of the tasks need multiple lines of codes? A simple function in base R could keep students focussed on interpreting data rather than worrying about missing a comma or a parenthesis.

stargazer* is an R package that simplifies this task. Here is the output from stargazer.

library(stargazer)
stargazer(mtcars[c("mpg", "disp",  "hp")], type="text")

============================================
Statistic N   Mean   St. Dev.  Min     Max  
--------------------------------------------
mpg       32 20.091   6.027   10.400 33.900 
disp      32 230.722 123.939  71.100 472.000
hp        32 146.688  68.563    52     335  
--------------------------------------------

A simple task, I argue, should be accomplished simply. My plea will be to include in base R a simple command that may generate the above table with a command as simple as the one below:

descriptives(mpg, disp, hp)


*  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.  R package version 5.2. http://CRAN.R-project.org/package=stargazer



Wednesday, March 7, 2018

Is it time to ditch the Comparison of Means (T) Test?

For over a century, academics have been teaching the Student’s T-test and practitioners have been running it to determine if the mean values of a variable for two groups were statistically different.

It is time to ditch the Comparison of Means (T) Test and rely instead on the ordinary least squares (OLS) Regression.

My motivation for this suggestion is to reduce the learning burden on non-statisticians whose goal is to find a reliable answer to their research question. The current practice is to devote a considerable amount of teaching and learning effort on statistical tests that are redundant in the presence of disaggregate data sets and readily available tools to estimate Regression models.

Before I proceed any further, I must confess that I remain a huge fan of William Sealy Gosset who introduced the T-statistic in 1908. He excelled in intellect and academic generosity. Mr. Gosset published the very paper that introduced the t-statistic under a pseudonym, the Student. To this day, the T-test is known as the Student’s T-test.   

My plea is to replace the Comparison of Means (T-test) with OLS Regression, which of course relies on the T-test. So, I am not necessarily asking for ditching the T-test but instead asking to replace the Comparison of Means Test with OLS Regression.

The following are my reasons:

1.       Pedagogy related reasons:
a.       Teaching Regression instead of other intermediary tests will save instructors considerable time that could be used to illustrate the same concepts with examples using Regression.
b.       Given that there are fewer than 39 hours of lecture time available in a single semester introductory course on applied statistics, much of the course is consumed in covering statistical tests that would be redundant if one were to introduce Regression models sooner in the course.
c.        Academic textbooks for undergraduate students in business, geography, psychology dedicate huge sections to a battery of tests that are redundant in the presence of Regression models.
                                                               i.      Consider the widely used textbook Business Statistics by Ken Black that requires students and instructors to leaf through 500 pages before OLS Regression is introduced.
                                                             ii.      The learning requirements of undergraduate and graduate students not enrolled in economics, mathematics, or statistics programs are quite different. Yet most textbooks and courses attempt to turn all students into professional statisticians.
2.       Applied Analytics reasons
a.       OLS Regression model with a continuous dependent variable and a dichotomous explanatory variable produces the exact same output as the standard Comparison of Means Test.
b.       Extending the comparison to more than two groups is a straightforward extension in Regression where the dependent variable will comprise more than two groups.
                                                               i.      In the tradition statistics teaching approach, one advises students that the T-test is not valid to compare the means for more than two groups and that we must switch to learning a new method, ANOVA.
                                                             ii.      You might have caught on my drift that I am also proposing to replace teaching ANOVA in introductory statistics courses with OLS Regression.
c.        A Comparison of Means Test illustrated as a Regression model is much easier to explain than explaining the output from a conventional T-test.

After introducing Normal and T-distributions, I would, therefore, argue that instructors should jump straight to Regression models.

 Is Regression a valid substitute for T-tests?


In the following lines, I will illustrate that the output generated by OLS Regression models and a Comparison of Means test are identical. I will illustrate examples using Stata and R.

Dataset

I will use Professor Daniel Hamermesh’s data on teaching ratings to illustrate the concepts. In a popular paper, Professor Hamermesh and Amy Parker explore whether good looking professors receive higher teaching evaluations. The dataset comprises teaching evaluation score, beauty score, and instructor/course related metrics for 463 courses and is available for download in R, Stata, and Excel formats at: 


Hypothetically Speaking

Let us test the hypothesis that the average beauty score for male and female instructors is statistically different. The average (normalized) beauty score for male instructors was -0.084 for male instructors and 0.116 for female instructors. The following Box Plot illustrates the difference in beauty scores. The question we would like to answer is whether this apparent difference is statistically significant.


I first illustrate the Comparison of Means Test and OLS Regression assuming Equal Variances in Stata.

 Assuming Equal Variances (Stata)


Download data
       use "https://sites.google.com/site/statsr4us/intro/software/rcmdr-1/teachingratings.dta"
encode gender, gen(sex) // To convert a character variable into a numeric variable.

The T-test is conducted using:

       ttest beauty, by(sex)

The above generates the following output:


As per the above estimates, the average beauty score of female instructors is 0.2 points higher and the t-test value is 2.7209. We can generate the same output by running an OLS regression model using the following command:

reg beauty i.sex

The regression model output is presented below.



Note that the average beauty score for male instructors is -0.2 points lower than that of females and the associated standard errors and t-values (highlighted in yellow) are identical to the ones reported in the Comparison of Means test. 

Unequal Variances


But what about unequal variances? Let us first conduct the t-test using the following syntax:

ttest beauty, by(sex) unequal

The output is presented below:



Note the slight change in standard error and the associated t-test.

To replicate the same results with a Regression model, we need to run a different Stata command that estimates a variance weighted least squares regression. Using Stata’s vwls command:

vwls beauty i.sex


Note that the last two outputs are identical. 

Repeating the same analysis in R


To download data in R, use the following syntax:

url = "https://sites.google.com/site/statsr4us/intro/software/rcmdr-1/TeachingRatings.rda"
download.file(url,"TeachingRatings.rda")
load("TeachingRatings.rda")

For equal variances, the following syntax is used for T-test and the OLS regression model.
t.test(beauty ~ gender, var.equal=T)
summary(lm(beauty ~ gender))

The above generates the following identical output as Stata.


For unequal variances, we need to install and load the nlme package to run a gls version of the variance weighted least square Regression model.

t.test(beauty ~ gender)
install.packages(“nlme”)
library(nlme)
summary(gls(beauty ~ gender,  weights=varIdent(form = ~ 1 | gender)))

The above generates the following output:


So there we have it, OLS Regression is an excellent substitute for the Comparison of Means test.



Sunday, January 28, 2018

When it comes to Amazon's HQ2, you should be careful what you wish for

By Murtaza Haider and Stephen Moranis
Note: This article originally appeared in the Financial Post on January 25, 2018

Amazon.com Inc. has turned the search for a home for its second headquarters (HQ2) into an episode of The Bachelorette, with cities across North America trying to woo the online retailer.
The Seattle-based tech giant has narrowed down the choice to 20 cities, with Toronto being the only Canadian location in the running.
While many in Toronto, including its mayor, are hoping to be the ideal suitor for Amazon HQ2, one must be mindful of the challenges such a union may pose.
Amazon announced in September last year that its second headquarters will employ 50,000 high-earners with an average salary of US$100,000. It will also require 8 million square feet (SFT) of office and commercial space.
A capacity-constrained city with a perennial shortage of affordable housing and limited transport capacity, Toronto may be courting trouble by pursuing thousands of additional highly-paid workers. If you think housing prices and rents are unaffordable now, wait until the Amazon code warriors land to fight you for housing or a seat on the subway.
The tech giants do command a much more favourable view in North America than they do in Europe. Still, their reception varies, especially in the cities where these firms are domiciled. Consider San Francisco, which is home to not one but many tech giants and ever mushrooming startups. The city attracts high-earning tech talent from across the globe to staff innovative labs and R&D departments.
These highly paid workers routinely outbid locals and other workers in housing and other markets. No longer can one ask for a conditional sale offer that is subject to financing because a 20-something whiz kid will readily pay cash to push other bidders aside.
We wonder whether Toronto’s residents, or those of whichever city ultimately wins Amazon’s heart, will face the same competition from Amazon employees as do the residents of Seattle? The answer lies in the relative affordability gap.
Amazon employees with an average income of US$100,000 will compete against Toronto residents whose individual median income in 2015 was just $30,089. It is quite likely that the bidding wars that high-earning tech workers have won hands down in other cities will end in their favour in the city chosen for Amazon HQ2.
While we are mindful of the challenges that Amazon HQ2 may pose for a capacity-constrained Toronto, we are also alive to the opportunities it will present. For starters, Toronto can use 50,000 high-paying jobs.

GIG ECONOMY

The emergence of the gig economy has had an adverse impact in the City of Toronto, where the employment growth has largely concentrated in the part-time category. Between 2006 and 2016, full-time jobs grew by a mere 8.7 per cent in Toronto, while the number of part-time jobs grew at four times that rate.
While being the largest employment hub in Canada, with an inventory of roughly 180 million square feet, an influx of 8 million square feet of first-rate office space will improve the overall quality of commercial real estate in Toronto. It could also be a boon for office construction and a significant source of new property tax revenue for the city.
But those hoping the city itself might make money should seriously consider the fate of cities lucky enough to host the Olympics, which more often than not end up costing cities billions more than they budgeted for.
Toronto may still pursue Amazon HQ2, but it should do so with the full knowledge of its strengths and vulnerabilities. At the very least, it should create contingency plans to address the resulting infrastructure deficit (not just public transit) and housing affordability issues before it throws open its doors for Amazon.
Murtaza Haider is an associate professor at Ryerson University. Stephen Moranis is a real estate industry veteran. They can be reached at info@hmbulletin.com.

Friday, June 16, 2017

Did the cold weather put a chill on Toronto’s housing market?

Toronto’s housing market took a dive in May. After years of record highs in housing sales and prices, the hype seems to have evaporated. While some link the slowdown to the Ontario government’s legislation to tighten lending in housing markets, one should also factor in the unusually cold, dark, and wet weather in May that felt more like a ‘May-be.'
          
Housing sales in the greater Toronto area (GTA) were down 23% last month from a year earlier. However, the average sales price was 14.8% higher than the price in May 2016. On a month-by-month basis, housing prices in May were down by 6% than the prices in April.

The declining numbers have alarmed homebuyers, sellers, brokerages, and governments. Many are questioning if the Ontario government’s intervention has a more adverse impact than was intended. Homebuyers, who have not yet closed on properties, are wondering whether they have paid too much, while sellers are rushing to list properties to benefit from high housing prices that appear to be past their peak.

While it is too early to determine the ‘causal impact’ of the legislative changes introduced in April, which included a 15% tax on foreign home buyers, one must also consider other mitigating factors that might have affected Toronto’s housing market. We must even consider the influence of the weather.       

The unusually cold weather in May might have had a chilling effect on housing sales. Typically, housing markets start to heat up in April while being in synch with the rising temperatures. May 2017 was unusually wet. Toronto received a total of 157 mm of precipitation last month compared to 25 mm a year ago. The unusually high rainfall caused flooding all across Ontario. In downtown Toronto, Lake Ontario water rushed into lakefront condos.  At the same time, May 2017 was unusually cooler than last year. The average temperature last month was 12 degrees Celsius compared to 16 degrees in May 2016. May was also unusually dark with much less sunshine. However, Toronto saw this trend since January 2017 when it received a mere 50 hours of sunlight compared to the seasonal average of 85 hours.


 So why should an unusually cold, dark, and wet weather have any impact on housing markets? Research has shown that weather and atmosphere influence consumer behavior. Retail experts call this phenomenon ‘store atmospherics’ where a store’s environment is altered to enhance consumer behavior that may promote sales. It applies to housing markets as well. Researchers discovered that adverse weather has a significant, yet the short-term effect on economic activity. Writing in Real Estate Economics, John Goodman Jr. found a slight adverse impact of unseasonable weather on housing markets. In related work, researchers found that sale prices of homes with central air-conditioning and swimming pools are higher for sales recorded in summer months.

There are other factors to consider in assessing the market dip. The Ontario government’s regulations to tighten housing markets could have encouraged some homebuyers to advance their purchase to avoid uncertainty. The government’s plans to impose new restrictions on housing markets were known in advance of their announcement in April. Investors are risk and uncertainty averse. Hence some homebuyers could have advanced their purchase to March when the sales unexpectedly jumped by 50% over February 2017. As for those who could not advance their purchase to March, they may have decided to sit through this confusion and wait for calmer markets to prevail.

In earlier research, we documented a similar trend for housing sales in Toronto, when sales escalated in 2007 in advance of Toronto’s new land transfer tax, which was implemented in February 2008. The additional sales recorded in 2007 meant that fewer sales were realized in 2008. The sales activity returned to the long-term trends in a couple of years. 

And if this was not enough, financial troubles at the alternative mortgage lender, Home Capital, spooked borrowers who were not deemed mortgage worthy by the mainstream Canadian banks. Many real estate professionals believe the cumulative effect of unseasonal weather, tightening of mortgage regulations, and troubles at alternative lenders were likely the reason behind the declining housing sales and prices.

The roof is not collapsing on Toronto’s housing market. The decline in sales and prices is a rational response by homebuyers and sellers who are reacting to Ontario government’s initiatives to tighten lending in housing markets. The cold, dark, and wet weather certainly did not help either.

Wednesday, September 14, 2016

Data Science 101, now online

We are delighted to note that IBM's BigDataUniversity.com has launched the quintessential introductory course on data science aptly named Data Science 101.

The target audience for the course is the uninitiated cohort that is curious about data science and would like to take the baby steps to a career in data and analytics. Needless to say, the course is for absolute beginners.

To get a taste of the course, watch the following video "What is Data Science?


Here is the curriculum:

  • Module 1 - Defining Data Science
    • What is data science?
    • There are many paths to data science
    • Any advice for a new data scientist?
    • What is the cloud?
    • "Data Science: The Sexiest Job in the 21st Century"
  • Module 2 - What do data science people do?
    • A day in the life of a data science person
    • R versus Python?
    • Data science tools and technology
    • "Regression"
  • Module 3 - Data Science in Business
    • How should companies get started in data science?
    • R versus Python
    • Tips for recruiting data science people
    • "The Final Deliverable"
  • Module 4 - Use Cases for Data Science
    • Applications for data science
    • "The Report Structure"
  • Module 5 -Data Science People
    • Things data science people say
    • "What Makes Someone a Data Scientist?"
Want to learn more about IBM's Big Data University, Click HERE.

Thursday, September 1, 2016

The X-Factors: Where 0 means 1

Hadley Wickham in a recent blog post mentioned that "Factors have a bad rap in R because they often turn up when you don’t want them." I believe Factors are an even bigger concern. They not only turn up where you don't want them, but they also turn things around when you don't want them to.

Consider the following example where I present a data set with two variables: and y. I represent age in years as 'y' and gender as a binary (0/1) variable as 'x' where 1 represents males.

I compute the means for the two variables as follows:


The average age is 43.6 years, and 0.454 suggests that 45.4% of the sample comprises males. So far so good. 

Now let's see what happens when I convert x into a factor variable using the following syntax:


The above code adds a new variable male to the data set, and assigns labels female and male to the categories 0 and 1 respectively.

I compute the average age for males and females as follows:


See what happens when I try to compute the mean for the variable 'male'.


Once you factor a variable, you can't compute statistics such as mean or standard deviation. To do so, you need to declare the factor variable as numeric. I create a new variable gender that converts the male variable to a numeric one.

I recompute the means below. 


Note that the average for males is 1.45 and not 0.45. Why? When we created the factor variable, it turned zeros into ones and ones into twos. Let's look at the data set below:


Several algorithms in R expect the factor variable to be of 0/1 form. If this condition is not satisfied, the command returns an error. For instance, when I try to estimate the logit model with gender as the dependent variable and as the explanatory variable, R generates the following error:

Factor or no factor, I would prefer my zeros to stay as zeros!



Monday, August 22, 2016

Five Questions about Data Science

---

Recently, we were able to ask five questions of Murtaza Haider, about the new book from IBM Press called “Getting Started with Data Science: Making Sense of Data with Analytics.” Below, the author talks about the benefits of data science in today’s professional world.

Getting Started with Data Science

1. What are some examples of data science altering or impacting traditional professional roles already?

Only a few years ago there did not exist a job with the title Chief data scientist. But that was then. Small and large corporations, and increasingly government agencies are putting together teams of data scientists and analysts under the leadership of Chief data scientists. Even White House has a Chief data scientist position, currently held by Dr. DJ Patel.

The traditional role for those who analyzed data was that of a computer programmer or a statistician. In the past, firms collected large amounts of data to archive rather than to subject it to analytics to assist with smart decision-making. Companies did not see value in turning data into insights and instead relied on the gut feeling of managers and anecdotal evidence to make decisions.

Big data and analytics have alerted businesses and governments to the latent potential of turning bits and bytes into profits. To enable this transformation, hundreds of thousands of data scientists and analysts are needed. Recent reports suggest that the shortage of such professionals will be in millions. No wonder we see hundreds of postings for data scientists on LinkedIn.

As businesses increasingly depend upon analytics-driven decision making, data scientists and analysts are simultaneously becoming front-office superstars, which is quite a change from them being the back office workers in the past.

2. What steps can a professional take today to learn how and why to implement data science into their current role?

Sooner than later, workers will find their managers asking them to assume additional responsibilities that would involve dealing with data, and either generating or consuming analytics. Smart professionals, who are uninitiated in data science, would therefore proactively address this shortcoming in their portfolio by acquiring skills in data science and analytics. Fortunately, in the world awash with data, the opportunities to acquire analytic skills are also ubiquitous.

For starters, professionals should consider enrolling in open online courses offered by the likes of Coursera and BigDataUniversity.com. These platforms offer a wide variety of training opportunities for beginners and advanced users of data and analytics. At the same time, most of these offerings are free.

For those professionals who would like to pursue a more structured approach, I suggest that they consider continuing education programs offered by the local universities focusing on data and analytics. While working full-time, the professionals can take part-time courses in data science to fill the gap in their learning and be ready to embrace impending change in their roles.

3. Do you need programming experience to get started in data? What kind of methods and techniques can you utilize in a program more commonly used, such as Excel?

Computer programming skills are a definite plus for data scientists, but they are certainly not a limiting factor that would prevent those trained in other disciplines from joining the world of data scientists. In my book, Getting Started with Data Science, I mentioned examples of individuals who took short courses in data science and programming after graduating from non-empirical disciplines, and subsequently were hired in data scientist roles that paid lucrative salaries.

The choice of analytics tools depends largely on the discipline and the type of organization you are currently working for or intend to work for in the future. If you intend to work for corporations that generate real big data, such as telecom and Internet-based establishments, you need to be proficient in big data tools, such as Spark and Hadoop. If you would like to be employed in the industry that tracks social media, you would require skills in natural language programming and proficiency in languages such as Python. If you happen to be interested in a traditional market research firm, you need proficiency in analytics software, such as SPSS and R.

If your focus is on small and medium size enterprises, proficiency in Excel could be a great asset, which would allow you to deploy its analytics capabilities, such as Pivot Tables, to work with small sized data.

A successful data scientist is one who knows some programming, basic understanding of statistical principles, possesses a curious mind, and is capable of telling great stories. I argue that without the storytelling capabilities, a data scientist will be limited in his or her abilities to become a leader in the field.

4. How do you see data science affecting education and training moving forward? What benefits will it bring to learning at all levels?

Schools, colleges, universities and others involved in education and learning are putting big data and analytics to good use. Universities are crunching large amounts of data to determine what gaps in learning at the high school level act as impediments to success in the future. Schools are improving not just curriculum, but also other strategies to improve learning outcomes. For instance, research in India using large amounts of data showed that when children in low-income communities were offered free meals at school, their dropout rates declined and their academic achievements improved.

Big data and analytics provide instructors and administrators the opportunity to test their hypothesis about what works and what doesn’t in learning, and replace anecdotes with hard evidence to improve pedagogy and learning. Learning has taken a new shape and form with open online courses in all disciplines. These transformative changes in learning have been enabled by advances in information and communication technologies, and the ability to store massive amounts of data.

5. Do you think that modern governments and societies are prepared for what changes that big data and data science might bring to the world?

Change is inevitable. Despite what modern governments and societies like, they would have to embrace change. Fortunately, smart governments and societies have already embraced data-driven decision-making and evidence-based planning. Governments in developing countries are already using data and analytics to devise effective poverty-reducing strategies. Municipal governments in developed economies are using data and advanced analytics to find solutions to traffic congestion. Research in health and well-being is leveraging big data to discover new medicines and cures for illnesses that challenge us all.

As societies embrace data and analytics as tools to engineer prosperity and well-being, our collective abilities to achieve a better tomorrow will be further enhanced.