Two Way ANOVA in R

Introduction

The more advanced methods in statistics have generally been developed to answer real-world questions, and ANOVA is no different.

  • How do we answer questions in the real world, as to which route from home to work on your daily commute is easier, or
  • How would you know which air-conditioner to choose out of a bunch that you’re evaluating in various climates?
  • If you were dealing with a bunch of suppliers, and wanted to compare their process results all at the same time, how would you do it?
  • If you had three competing designs for a system or an algorithm, and wanted to understand whether one of them was significantly better than the others, how would you do that statistically?

ANOVA answers these kinds of questions – it helps us discover whether we have clear reasons to choose a particular alternative over many others, or determine whether there is exceptional performance (good or bad) that is dependent on a factor.

We discussed linear models earlier – and ANOVA is indeed a kind of linear model – the difference being that ANOVA is where you have discrete factors whose effect on a continuous (variable) result you want to understand.

The ANOVA Hypothesis

The ANOVA hypothesis is usually explained as a comparison of population means estimated based on the sample means. What we’re trying to understand here is the effect of a change in the level of one factor, on the response. The term “Analysis of Variance” for ANOVA is therefore a misnomer for many new to inferential statistics, since it is a test that compares means.

A simplified version of the One-Way ANOVA hypothesis for three samples of data (the effect of a factor with three possible values, or levels) is below:

H_0 : \mu_1 = \mu_2 = \mu_3

While Ha could be:

H_a : \mu_1 \neq \mu_2 = \mu_3, or
H_a : \mu_1 = \mu_2 \neq \mu_3, or
H_a : \mu_1 \neq \mu_2 = \mu_3

It is possible to understand the Two-Way ANOVA problem, therefore, as a study of the impact of two different factors (and their associated levels) on the response.

Travel Time Problem

Let’s look at a simple data set which has travel time data organized by day of the week and route. Assume you’ve been monitoring data from many people travelling a certain route, between two points, and you’re trying to understand whether the time taken for the trip is more dependent on the day of the week, or on the route taken. A Two-Way ANOVA is a great way to solve this kind of a problem.

The first few rows of our dataset
The first few rows of our dataset

We see above how the data for this problem is organized. We’re essentially constructing a linear model that explains the relationship between the “response” or “output” variable Time, and the factors Day and Route.

Two Way ANOVA in R

ANOVA is a hypothesis test that requires the continuous variables (by each factor’s levels) to normally distributed. Additionally, ANOVA results are contingent upon an equal variance assumption for the samples being compared too. I’ve demonstrated in an earlier post how the normality and variance tests can be run prior to a hypothesis test for variable data.

The code below first pulls data from a data set into variables, and constructs a linear ANOVA model after the normality and variance tests. For normality testing, we’re using the Shapiro-Wilk test, and for variance testing, we’re using the bartlett.test() command here, which is used to compare multiple variances.

#Reading the dataset
Dataset<-read.csv("Traveldata.csv")
str(Dataset)

#Shapiro-Wilk normality tests by Day
cat("Normality p-values by Factor Day: ")
for (i in unique(factor(Dataset$Day))){
  cat(shapiro.test(Dataset[Dataset$Day==i, ]$Time)$p.value," ")
}
cat("Normality p-values by Factor Route: ")

#Shapiro-Wilk normality tests by Route
for (i in unique(factor(Dataset$Route))){
  cat(shapiro.test(Dataset[Dataset$Route==i, ]$Time)$p.value," ")
}

#Variance tests for Day and Route factors
bartlett.test(Time~Day,data = Dataset )
bartlett.test(Time~Route,data = Dataset )

#Creating a linear model
#The LM tells us both main effects and interactions
l <- lm(Time~ Day + Route + Day*Route , Dataset)
summary(l)

#Running and summarizing a general ANOVA on the linear model
la <- anova(l)
summary(la)

#Plots of the linear model and Cook's Distance
plot(cooks.distance(l), 
     main = "Cook's Distance for linear model", xlab =
       "Travel Time (observations)", ylab = "Cook's Distance")
plot(l)
plot(la)

Results for the Bartlett test are below:

> bartlett.test(Time~Day,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Day
Bartlett's K-squared = 3.2082, df = 4, p-value = 0.5236

> bartlett.test(Time~Route,data = Dataset )

	Bartlett test of homogeneity of variances

data:  Time by Route
Bartlett's K-squared = 0.8399, df = 2, p-value = 0.6571

The code also calculates Cook’s distance, which is an important concept in linear models. When trying to understand any anomalous terms in the model, we can refer to the Cook’s distance to understand whether those terms have high leverage in the model, or not. Removing a point with high leverage could potentially affect the model results. Equally, if your model isn’t performing well, it may be worth looking at Cook’s distance.

Cook's Distance for our data set, visualized
Cook’s Distance for our data set, visualized
Cook's distance, explained by its importance to leverage in the model.
Cook’s distance, explained by its importance to leverage in the model.

When looking at the graphs produced by lm, we can understand the how various points in the model have different values of Cook’s distance, and we also understand their relative leverages. This is also illustrated in the Normal Quantile-Quantile plot below, where you can see observations #413 and #415 that have large values, among others.

Normal QQ plot of data set showing high leverage points (large Cook's Distance)
Normal QQ plot of data set showing high leverage points (large Cook’s Distance)

ANOVA Results

A summary of the lm command’s result is shown below.


Call:
lm(formula = Time ~ Day + Route + Day * Route, data = Dataset)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.333  -4.646   0.516   4.963  19.655 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   54.34483    1.39067  39.078   <2e-16 ***
DayMon        -3.34483    1.95025  -1.715   0.0870 .  
DayThu         2.69221    2.00280   1.344   0.1795    
DayTue        -0.43574    1.90618  -0.229   0.8193    
DayWed        -0.01149    2.00280  -0.006   0.9954    
RouteB        -1.02130    1.89302  -0.540   0.5898    
RouteC        -1.83131    1.85736  -0.986   0.3246    
DayMon:RouteB  2.91785    2.71791   1.074   0.2836    
DayThu:RouteB  0.39335    2.63352   0.149   0.8813    
DayTue:RouteB  3.44554    2.64247   1.304   0.1929    
DayWed:RouteB  1.23796    2.65761   0.466   0.6416    
DayMon:RouteC  5.27034    2.58597   2.038   0.0421 *  
DayThu:RouteC  0.24255    2.73148   0.089   0.9293    
DayTue:RouteC  4.48105    2.60747   1.719   0.0863 .  
DayWed:RouteC  1.95253    2.68823   0.726   0.4680    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 7.489 on 485 degrees of freedom
Multiple R-squared:  0.04438,	Adjusted R-squared:  0.01679 
F-statistic: 1.609 on 14 and 485 DF,  p-value: 0.07291

While the model above indicates the effect of each factor on the response, it doesn’t compute the f-statistic, which is by taking into consideration the “within” and “between” variations in the samples of data. The ANOVA mean squares and sum of squares approach does exactly this, which is why the results from that are more relevant here. And the summary below is the ANOVA model itself, in the ANOVA table:

Analysis of Variance Table

Response: Time
           Df  Sum Sq Mean Sq F value  Pr(>F)   
Day         4   823.3 205.830  3.6700 0.00588 **
Route       2    46.0  23.005  0.4102 0.66376   
Day:Route   8   393.9  49.237  0.8779 0.53492   
Residuals 485 27201.3  56.085                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

By looking at the results above, it is clear that the Day factor has a statistically significant impact on the Travel time. (Upon closer observation, you’ll see that one of those means that was different from others, was the mean for Monday! This is clearly an example inspired by Monday rush hour traffic!)

When reading results from the lm and anova commands, it is important to note that R indicates the results using significance codes. A small p-value indicates a significant result, and the relative significance of different factors is indicated by assigning different symbols to them. For instance, two asterixes (**) are used when we get a p-value of < 0.001. Depending on the nature of your experiment, you can choose your significance level and understand the results in a comparison of these p-values with significance. Also, in this specific case, the Route factor seems to have an insignificant impact on the response.

Interactions

When we have two or more terms in a model which function as inputs to a response variable, we also need to evaluate whether a change in both variables causes a different effect on the response, as opposed to fixing one and changing the other. This is referred to as an interaction, and interaction effects are taken into account in our model. Once again, the p-values for the interactions can inform us about the relative importance of different interactions.

Concluding Remarks

We’ve seen in this example how the Analysis of Variance (ANOVA) approach can be used to compare the impact of two factors on a response variable. Cook’s distance and its importance were also explained. It is important to make each of the data points within our data set count, and at the same time, it is important to evaluate the model’s veracity and validity to what we want to understand. In this context, understanding the significance of our results (statistically and practically) is necessary. Fortunately, the linear modeling packages in R are very convenient for such modeling, and incorporate lots of added functionality, like being able to call plots on them by simply using the plot command on a saved model. This functionality really comes into its own, when you make and test many different models and want to compare results.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s