Chapter 17 Analysis of Variance

The ANOVA (ANalysis Of VAriance) test is a classical statistical test associated with R. Fisher. The test is suited to situations where we have observations grouped by a categorical variable and we want to understand whether or not there are differences in the group means. The mathematical notation for ANOVA is horrible, but the intuition is relatively easy to understand.

The ANOVA test statistic follows the F-distribution, and large values are evidence against the null hypothesis that the group means are all the same. In practice we can think of the test as dividing the observed variation in the data into:

  1. The variation between the group means and the grand mean (the average across all the groups); and
  2. The variation within each group.

If the variation between the group means and the grand mean is high, and the variation within groups is small, this suggests there are true differences in the group means. On the other hand, if the variation between the group means and the grand mean is small, and the variation within groups is large, this suggests there are no real differences in the group means i.e. the variations we observe is just sampling variation.

For the ANOVA test, if we observe a p-value of less than 0.05 we say we have sufficient evidence to reject the null hypothesis that all the group means are the same. If we reject the null we then move on to pair-wise t-tests to try and determine which specific group means are different. For the pair-wise t-tests we use an adjustment method to control the Type I error rate. The need to adjust p-values to control the Type I error rate is one of the reasons we need specialist software.

If we observe a p-value that is greater than 0.05 we say we have insufficient evidence to reject the null hypothesis that all the group means are the same. If we do not reject the null hypothesis we do not proceed to pair-wise t-tests. We just stop our analysis at the ANOVA test. We have concluded that there are no significant differences so we stop.\

Similar to the case for the t-test, for the ANOVA test there are two possible formulas that we can use: one formula when the group variances are the same, and one formula for when the group variances are not the same. So, before we conduct an ANOVA test we have to conduct a pre-test to establish whether the group variances are the same or not. The test we use for this in the Bartlett test. If we reject the null for the ANOVA test we then go on to look for the specific differences between groups. This means that when we conduct an ANOVA test we usually have three tests to conduct:

  1. A Bartlett test to check whether the group variances are the same
  2. The actual ANOVA test (with either equal or unequal variance)
  3. Pair-wise t-tests, with an adjustment for multiple comparisons (if we reject the null for the ANOVA test).

We will work through these steps below. As with all previous statistical tests, we will set the alpha level at 0.05.

17.1 ANOVA - One Factor

For this example we will use the base R data set chickwts, which has the weights of chickens fed six different types of feed. The data is in long format. This means we have one continuous variable (weight) and one factor variable (feed) that has six levels. When working with multiple groups it is easiest to work with data in long format.

Remark. This is the same data set used in the Equal Variance Testing in R reference guide. What we are trying to do is help a farmer identify which feed(s) promote the greatest weight gain in chickens.

Let’s start by looking at our data and creating a box plot:

# get summary & structure and create a boxplot of our differences
str(chickwts)
#> 'data.frame':    71 obs. of  2 variables:
#>  $ weight: num  179 160 136 227 217 168 108 124 143 140 ...
#>  $ feed  : Factor w/ 6 levels "casein","horsebean",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(chickwts)
#>      weight           feed   
#>  Min.   :108   casein   :12  
#>  1st Qu.:204   horsebean:10  
#>  Median :258   linseed  :12  
#>  Mean   :261   meatmeal :11  
#>  3rd Qu.:324   soybean  :14  
#>  Max.   :423   sunflower:12
with(chickwts,boxplot(weight~feed,
     col= "lightgray",
     main= "",
     xlab= "Feed type", ylab= "Weight (g)", ylim= c(100,450), las= 1)) 
Chicken weight distributions for six different feed types

Figure 17.1: Chicken weight distributions for six different feed types

Based on the boxplot there look to be clear differences between at least some of the means, but it is not easy to see what the actual values are. Let’s use a new function, tapply(), to calculate the mean and standard deviation for each of our groups. Here you are applying the function mean() and sd() to the vector weight, grouped by feed. The tapply() function works when we have data in long format. The function generates results similar to the way results are generated using the pivot table function in MS Excel. Personally, I prefer to use MS Excel to work through data management issues, but you can do similar things with R.

with(chickwts, tapply(weight, feed, mean))
#>    casein horsebean   linseed  meatmeal   soybean sunflower 
#>       324       160       219       277       246       329
with(chickwts, tapply(weight, feed, sd))
#>    casein horsebean   linseed  meatmeal   soybean sunflower 
#>      64.4      38.6      52.2      64.9      54.1      48.8

As we saw in our boxplot, there is quite a difference between some of our group means, particularly horse bean on the lower end, and sunflower and casein on the higher end. What about our standard deviations? Remember, the standard deviation is the square root of the variance so the group standard deviations gives us an idea of whether the group variances are equal or not. The units of measurement for the standard deviation are just a bit easier to interpret than the units of measurement for the variance. Look carefully at the reported standard deviation values and also at the boxplot. Can you see the link between the visual display of the data and the group standard deviation values?

Now let’s formally test for equal variance across the groups. Because we have multiple groups the test we use is the Bartlett Test of Homogeneity of Variances. The Null hypothesis for the test is that the group variances are all equal; the Alternate hypothesis is that the group variances are not equal. The implementation of the test follows the standard format we use for most tests. First we use the with() function to direct R to the dataset we want to use. Next we use the bartlett.test() function to tell R what function to apply. Finally we specify the column names that contain the data we are working with. Although this explanation is quite brief, there is a separate reference guide on the Bartlett test that explains the steps in greater detail.

with(chickwts, bartlett.test(weight ~ feed))  # long data format, use ~ not ,
#> 
#>  Bartlett test of homogeneity of variances
#> 
#> data:  weight by feed
#> Bartlett's K-squared = 3, df = 5, p-value = 0.7

As the test p-value = 0.66, which is greater than 0.05, we do not reject the null hypothesis that the group variances are all equal: we have insufficient evidence to reject the null.

As such we set the parameter var.equal to TRUE in our ANOVA test of whether the groups means are the same.

Note: If we reject the null for the Bartlett test (i.e. we have a test p-value that is less than 0.05) when we conduct the ANOVA test we set the parameter var.equal to FALSE.

To implement the ANOVA test we use the function oneway.test(), and so for our formal test we have:

  • Step 1: Set the Null and Alternate Hypotheses

    • Null hypothesis: The group means are all equal
    • Alternate hypothesis: At least one group mean is different from the other groups
  • Step 2: Implement Analysis of Variance Test

    with(chickwts, oneway.test(weight ~ feed, var.equal = TRUE))
    #> 
    #>    One-way analysis of means
    #> 
    #> data:  weight and feed
    #> F = 15, num df = 5, denom df = 65, p-value = 6e-10
  • Step 3: Interpret the Results

From the output we see the p-value = 5.936e-10; which we report as \(<\) 0.001. As the test p-value is smaller than 0.05 we:
Reject the null hypothesis that all the means are equal.

In our output we can also see the F-statistic (15.36). This is the statistic that our p-value is based on. For large numbers we reject the null.

So what now? If you were a farmer wanting to know which feed to give your chickens, which feed would you choose? You could decide from your boxplot, likely choosing casein or sunflower, but the figure doesn’t tell you if there is a feed which is statistically the highest, and the ANOVA test only tells you that the feeds are not ALL the same. This is where pair-wise comparisons can be helpful.

17.1.1 Pair-wise Comparisons

Let’s come back to our research question - which feed(s) should the farmer consider based on the weights of chickens eating that feed. Here we will use the function pairwise.t.test() to compare all our individual group means. With this function the parameter pool.sd is the equivalent of var.equal for the t.test(). We use the Bartlett test result to decide whether we set this parameter to TRUE or FALSE.

If we do not reject the null for the Bartlett test we set the parameter to TRUE. If we reject the null for the Bartlett test we set the parameter to FALSE. So, for this example we will set it to TRUE.

The pairwise.t.test() function does two things that make our life easy. First, it calculates all the different possible pair-wise combinations for us at once. Rather than having to type the t.test() command 15 times (one for each pair-wise combination) we just need to write one command. The second thing that the pairwise.t.test() command does for us is to automatically adjust the reported p-values to control the Type I error rate. The automatic adjustment makes life easy for us as we can still apply our standard p-value decision rule to the results, where we know that the values have been adjusted to address the multiple comparison issue. This is something that only specialist software such as R is able to do, and it really saves a lot of time compared to the alternative of working through the adjustments manually.

It is possible to use a variety of different adjustments for multiple comparisons. For example, by changing the parameter p.adjust.method and providing “none” for no adjustment (lower p-values), or “bon” for a more restrictive adjustment (higher p-values). The default adjustment used by R is the holm adjustment, and this is a good choice. As such we will not be changing this parameter value. So conducting all the required comparisons is quite easy.

with(chickwts, pairwise.t.test(weight, feed, pool.sd = TRUE))
#> 
#>  Pairwise comparisons using t tests with pooled SD 
#> 
#> data:  weight and feed 
#> 
#>           casein horsebean linseed meatmeal soybean
#> horsebean 3e-08  -         -       -        -      
#> linseed   2e-04  0.094     -       -        -      
#> meatmeal  0.182  9e-05     0.094   -        -      
#> soybean   0.005  0.003     0.518   0.518    -      
#> sunflower 0.812  1e-08     8e-05   0.132    0.003  
#> 
#> P value adjustment method: holm

As you probably suspected based on the boxplot, we have some significant differences between groups (e.g. horsebean and casein) and some non-significant differences (e.g meat meal and linseed). The feeds with the two highest means, casein and sunflower, are not significantly different. We also see that the feed with the third highest mean, meat meal, is also not significantly different from either casein or sunflower. So, it looks like the farmer has some options in terms of the feed they select.

To make a final decision the farmer might now look to other factors such as cost or availability to decide what feed to provide her chickens. It is this final step that is important. The statistical test is just one part of the process of working out what to do. You then typically need to look at other information to workout what you actually need to do.

17.1.2 Reporting Test Results

How you present numerical results in a table is just as important as to how you present figures. You must include information in a clear and concise way, without causing distraction with unnecessary information. Table 17.1 presents the results of our pair-wise comparisons. This is a good format for you to follow.

Table 17.1: Pair-wise t-test results (Holm adjusted p-values): effect of feed on chicken weights
Casein Horsebean Linseed Meatmeal Soybean
Horsebean <0.001 - - - -
Linseed <0.001 0.094 - - -
Meatmeal 0.182 <0.001 0.094 - -
Soybean 0.005 0.003 0.518 0.518 -
Sunflower 0.812 <0.001 <0.001 0.132 0.003

17.2 The notation of the ANOVA model

The ANOVA test can be motivated by saying that when looking at grouped data the observed variation can be separated into two parts:

  1. The variation due to differences in the group means; and
  2. The variation of individual observations around each individual group mean.

The variation between group means is called the between group variation. The variation of individual observations around each individual group mean is called the within group variation.

The ANOVA test can be summarised with two propositions:

  1. If the variation due to differences in the group means is relatively large, while the variation of observations around each group mean is relatively small, it is likely there are true differences in the group means. For this scenario the test will generate a small p-value.
  2. If the variation due to differences in the group means is relatively small, while the variation of observations around the group mean is relatively large, it is unlikely there are true differences in the group means. For this scenario the test will generate a relatively large p-value.
A comparison of where differences are likely and unlikely

Figure 17.2: A comparison of where differences are likely and unlikely

If you understand the interpretation of the ANOVA test presented above you understand the fundamental element of ANOVA test. Unfortunately, the formal notation used to set out the ANOVA model is relatively intimidating. The complexity of the notation is one of the reasons a great many people refuse to come to terms with ANOVA. For the vast majority of people formal notation serves to obscure rather than highlight what is going on. The ANOVA model is especially bad in this respect. The type of notation used for ANOVA analysis will appear again and again throughout your studies. Because of this, a relatively formal introduction to ANOVA notation is presented below.

17.2.1 Introduction to notation

Consider the case where we have three groups. In group 1 we have 4 observations, in group 2 we have 5 observations, and in Group 3 we have 4 observations. That means we have something like the following:

For group 1 we have four observations: \(X_{1},X_{2}, X_{3},\text{ and }X_{4}\) which we summarise as \(X_{j}, j=1,\ldots,4\). For group 2 we have five observations: \(X_{1},X_{2},X_{3},X_{4},\text{ and } X_{5}\),which we summarise as \(X_{j}\), \(j=1,...,5\). For group 3 we have four observations: \(X_{1},X_{2},X_{3},\text{ and }X_{4}\) which we summarise as \(X_{j}\), \(j=1,...,4\). What you can see is that when we have groups of observations things get confusing if we use a single sub-script to index observations.

In the above expressions \(j\) was used to keep track of the number of observations in each group. To keep track of which group an individual observation belongs to we need to introduce a second subscript. This looks complex, but it we take things slowly it makes sense. When we have groups we use the notation \(X_{ij}\) to keep track of all the observations. The \(i\) tells us which group the observation belongs to and the \(j\) keeps track of the individual observations within each group.

So for the specific case we have:

  1. For group 1 we have: \(X_{11}\), \(X_{12}\), \(X_{13}\), and \(X_{14}\)
  2. For group 2 we have: \(X_{21}\), \(X_{22}\), \(X_{23}\), \(X_{24}\), and \(X_{25}\);
  3. For group 3 we have: \(X_{31}\), \(X_{32}\), \(X_{33}\), and \(X_{34}\).

The number of observations in each group can be different. In our example there are 4 observations in group 1 and group 3, but 5 observations in group 2. The indexation for j (the number of observations within each group) must reflect this, and so we write \(j=1,\ldots,n_{i}\). This says that the number of observations in group \(i\) runs from 1 through to the maximum number for group \(i\) , and so reflects the fact that the number of observations in each group can vary. In our example we have \(n_{1}=4\), \(n_{2}=5\), and \(n_{3}=4\).

We can denote the individual group means as \(\bar{X}_{1}\), \(\overline{X}_{2}\) , and \(\overline{X}_{3}\), and the average across all observations as \(\overline{X}_{G}\), which we call the grand mean. We know how to calculate these values. Formally, we write the group mean for group \(i\) as: \[\overline{X}_{i}=\frac{1}{n_{i}}\sum_{j}X_{ij}.\] This expression says, to find the mean for any given group, take all the observations for that group, add them up, and divide through by the total number of observations for that group. The expression \(\sum_{j}\) is read \(j\). So, if we take group 1 to illustrate we would have:

\[ \begin{aligned} \overline{X}_{1}&=\frac{1}{n_{1}}\sum_{j}X_{1j}\nonumber\\ &=\frac{1}{4}\left(X_{11}+X_{12}+X_{13}+X_{14}\right)\nonumber\\ &=\textrm{average for group 1.}\nonumber \end{aligned} \]

To find the overall grand mean for the data we know that we simply have to add up all the observations and divide through by the total number of observations. While in practice this is a simple operation in Excel or R, the formal notation is a little involved. First we have to implement a rule that will give us the total number of observations, which we will denote with \(N\) . The formal notation we use to denote the total number of observations is as follows:

\[ \begin{aligned} N&=\sum_{i}n_{i}\nonumber\\ &=n_{1}+n_{2}+n_{3}\nonumber\\ &=4+5+4\nonumber\\ &=13.\nonumber \end{aligned} \]

Now, we need a way to say add up in a sequence all the observations in group 1, group 2, and group 3. If we had a single group we could use the expression \(\sum_{j}X_{j}\) , which say add up all the observations indexed by \(j\). However, we need to say add up all the observations in group 1, group 2, and group 3. To see how this works let us start by writing out in full what we want to do. We want to write something that will say add up all 13 observations:

\(X_{11}+X_{12}+X_{13}+X_{14}+X_{21}+X_{22}+X_{23}+X_{24}+X_{25}+X_{31}+X_{32}+X_{33}+X_{44}\)

We know how to say add up all the values within a group. For group 1 we write this as \(\sum_{j}X_{1j}\), for group 2 we write \(\sum_{j}X_{2j}\), and for group 3 we write \(\sum_{j}X_{3j}\). Now we want to add all these values together so we have something that looks like:
\[\sum_{j}X_{1j} + \sum_{j}X_{2j} + \sum_{j}X_{3j}.\] What we need is a second summation sign that says add up across the groups. Formally we write \(\sum_{i}\sum_{j}X_{ij}\) if we want to say add up all the observations from all groups. This expression says add up all the observations in all groups; where the groups are indexed by \(i\) and the individual observations within a group are indexed by \(j\). For the three groups we have considered the term \(\sum_{i}\sum_{j}X_{ij}\) translates as follows: \[\sum_{i}\sum_{j}X_{ij}=X_{11}+X_{12}+X_{13}+X_{14}+X_{21}+X_{22}+X_{23}+X_{24}+X_{25}+X_{31}+X_{32}+X_{33}+X_{44}.\]

To find the average of the observations so we need to divide through by the total number of observation, which is \(N=\sum_{i}n_{i}\). So, formally we denote the grand mean as: \[ \begin{aligned} \overline{X}_{G}&=\frac{1}{N}\sum_{i}\sum_{j}X_{ij}\nonumber\\ &=\frac{1}{13}\left(X_{11}+X_{12}+X_{13}+X_{14}+X_{21}+X_{22}+X_{23}+X_{24}+X_{25}+X_{31}+X_{32}+X_{33}+X_{44}\right).\nonumber \end{aligned} \]

In terms of notation this is as complicated as things get, and we now have all the elements we need to consider a formal ANOVA model.

17.2.2 The ANOVA model

The proposition presented regarding the ANOVA test is that if the variation due to differences in the group means is relatively large, while the variation of observations around each group mean is relatively small it is likely there are true differences in the group means. The alternative contrasting scenario is one where the variation due to differences in the group means is relatively small, while the variation of observations around each group mean is relatively large it is unlikely there are true differences in the group means.\

The ANOVA test statistic is: \[\frac{\textrm{standardisied measure of between group variation}}{\textrm{standardisied measure of within group variation}}.\] This test statistic follows the \(F\)-distribution, and large values for \(F\) (p-value \(<\) 0.05) lead us to reject the null hypothesis of no differences between the groups. When adding up the differences between individual observations and the group mean, or differences between the group mean and the grand mean, positive and negative values will cancel out. To ensure that we have a positive value for the variation we therefore square each difference calculation.

17.2.3 Between group variation

We find the total between group variation for each group as the difference between the group mean and the grand mean multiplied by the number of observations in each group. This measure can be written as:

\(n_{1}\left(\overline{X}_{1}-\overline{X}_{G}\right)^{2}+n_{2}\left(\overline{X}_{2}-\overline{X}_{G}\right)^{2}+n_{3}\left(\overline{X}_{3}-\overline{X}_{G}\right)^{2}\).

In our example we have 4 observations in group 1, 5 observations in group 2, and 4 observations in group 3, so we have \(n_{1}=4\), \(n_{2}=5\), \(n_{3}=4\) or

\(4\times\left(\overline{X}_{1}-\overline{X}_{G}\right)^{2}+5\times\left(\overline{X}_{2}-\overline{X}_{G}\right)^{2}+4\times\left(\overline{X}_{3}-\overline{X}_{G}\right)^{2}\).

We can however also use the summation operator to write this in a formal way as: \(\sum_{i}n_{i}\left(\overline{X}_{i}-\overline{X}_{G}\right)^{2}\) = the total between group variation, and we call this measure the between group sum of squares. Note that the indexation subscript uses \(i\) as we are comparing group means to the grand mean. The standardisation we use is to divide through by the degrees of freedom. In a statistics context degrees of freedom works as follows. When we consider the sum of squares calculations there is a constraint that means if we have all but the last piece of information we are able to retrieve the final missing value. The final value is not free to take any value but rather is constrained to the value required by the formula we are using.

For our between group variation, if we know any \(k-1\) values we can retrieve the final value. The standardisation we use for the between group sum of squares is therefore \(k-1\). So, to summarise we have:\

\[\sum_{i}n_{i}\left(\overline{X}_{i}-\overline{X}_{G}\right)^{2} = \textrm{the between group sum of squares.}\]

We then have as our standardised measure of the between group variation: \[ \begin{aligned} \frac{\sum_{i}n_{i}\left(\overline{X}_{i}-\overline{X}_{G}\right)^{2}}{k-1} &=& \textrm{the standardised between group sum of squares}\nonumber\\ &=& \textrm{the between group mean squares.}\nonumber \end{aligned} \]

17.2.4 Within group variation

For the within group variation we have the same problem with positive and negative values that would cancel out, so again we square each deviation around the individual group means. The variation we want to add up is the variation of the individual observations around their respective group means. Specifically, the values that we want to add up are the following values:

\[ \begin{aligned} \textrm{Sum of squared deviations}&&\nonumber\\ \textrm{from group means}&=&\left(X_{11}-\overline{X}_{1}\right)^{2}+\cdots+\left(X_{14}-\overline{X}_{1}\right)^{2}+\nonumber\\ &&\left(X_{21}-\overline{X}_{2}\right)^{2}+\cdots+\left(X_{25}-\overline{X}_{2}\right)^{2}+\nonumber\\ &&\left(X_{31}-\overline{X}_{3}\right)^{2}+\cdots+\left(X_{34}-\overline{X}_{3}\right)^{2}\nonumber \end{aligned} \]

We know that we can write the sum of the squared deviations from the mean for the groups individually as:

  • within group variation for Group 1 =\(\sum_{j}\left(X_{1j}-\overline{X}_{1}\right)^{2}\)
  • within group variation for Group 2 =\(\sum_{j}\left(X_{2j}-\overline{X}_{2}\right)^{2}\)
  • within group variation for Group 3 =\(\sum_{j}\left(X_{3j}-\overline{X}_{3}\right)^{2}\)

This means that we can, in the first instance pull together the information that describes the within group variation as: \[ \begin{aligned} \sum_{i}\sum_{j}\left(X_{ij}-\overline{X}_{i}\right)^{2}&=& \left(X_{11}-\overline{X}_{1}\right)^{2}+\cdots+\left(X_{14}-\overline{X}_{1}\right)^{2}+\nonumber\\ &&\left(X_{21}-\overline{X}_{2}\right)^{2}+\cdots+\left(X_{25}-\overline{X}_{2}\right)^{2}+\nonumber\\ &&\left(X_{31}-\overline{X}_{3}\right)^{2}+\cdots+\left(X_{34}-\overline{X}_{3}\right)^{2}\nonumber \end{aligned} \]

We call this the within group sum of squares. Within each group there is one value that is not free to vary, and so the degrees of freedom is equal to \(N-k\); the total number of observations minus the number of groups. In the example we have been working with there are 19 observations and 3 groups so we have 16 degrees of freedom. We then have as our standardised measure of the within group variation:
\[ \begin{aligned} \frac{\sum_{i}\sum_{j}\left(X_{ij}-\overline{X}_{i}\right)^{2}}{N-k} &=& \textrm{the standardised within group sum of squares}\nonumber \\ &=& \textrm{within group mean squares.}\nonumber \end{aligned} \] Again, technically this value is a variance estimate which follows a chi-squared distribution.

The ANOVA test is then conducted as: \[F=\frac{\textrm{Between Group MS}}{\textrm{Within Group MS}},\] and we reject the null hypothesis of no significant difference in the group means if this value is large, where large is defined by the \(F\)-distribution (We use the fact that \(F\)-value, as the ratio of two chi-squared variables, follows the \(F\)-distribution).

In practice we need to know how to implement the test, and we can implement the test in R using the oneway.test() function.