Chapter 15 Power

In section 9.3.2 we displayed four scenarios that cover how our conclusions from hypothesis testing can be correct and incorrect. This table is reproduced here as Table 15.1. The notion of a p-value has also been introduced, and this is the probability of rejecting the null hypothesis when the null hypothesis is true. That is, the probability of making a Type 1 error.

What about the the probability of making a Type 2 error? We are also interested in this - when the alternative hypothesis is true, we want our hypothesis test to pick this up and lead us to a decision of rejecting the null hypothesis in favour of the alternative. The probability of making a Type 2 error is denoted by the Greek letter \(\beta\) (beta). A hypothesis test that has a high probability of alerting us to when we should reject the null hypothesis is called a test of high power. We quantify this by defining the power of a hypothesis test as: \[ \begin{aligned} \textrm{Power}&=1-P(\textrm{Type 2 Error})\\ &=1-\beta. \end{aligned} \]

Table 15.1: Four different scenarios for hypothesis tests.
	Conclusion:do no reject \(H_0\)	Conclusion: reject \(H_0\) in favour of \(H_A\)
\(H_0\) true	okay	Type 1 error
	\((1-\alpha)\)	\((\alpha)\)
\(H_A\) true	Type 2 error	okay
	\((\beta)\)	\((1-\beta)\)

When we say that power is the probability of rejecting \(H_0\) when \(H_A\) is true, that might (hopefully!) lead you to wonder:

What is the distribution of the variable of interest under the alternative hypothesis?

The null hypothesis in our examples for the population mean had a specific value for the population mean e.g. \(\mu=7\) in the example in Guided Practice 9.11 on hours of sleep per night. The alternative hypothesis was just that the true average was greater than 7.

\[ \begin{aligned} &\mathbf{H_0:}\ \mu = 7.\\ &\mathbf{H_A:}\ \mu > 7. \end{aligned} \]

The alternative hypothesis doesn’t have a single reference parameter value so we need to define one. We do this by defining the size of the shift from the null value that we wish to detect, the effect size.

Rather the original alternate of greater than, let us change this to just not equal to 7, which is the more common setting that we consider.

\[ \begin{aligned} &\mathbf{H_0:}\ \mu = 7.\\ &\mathbf{H_A:}\ \mu \neq 7. \end{aligned} \]

Thinking of the sleep study, the alternative hypothesis of \(\mu \neq 7\) would be true if the actual mean was 7.01 or 7.25 hours of sleep. However, if the true average was 7.01 hours of sleep it is much less likely that our test would pick up this difference than if the true average was 7.25 hours.

Defining this effect size requires an understanding of the subject matter to which the research relates. For example, in a clinical sense, 7.01 hours of sleep is not significantly more than 7 hours, but 7.25 hours might be. This starts to get to the difference between something that is statistically significant and something that is practically significance.

Remark (Statistical versus practical significance). This is another good place to remind ourselves of the difference between statistical significance and scientific/clinical/practical significance. As a thought experiment, just think that if you have a large enough sample size, and test

\[ \begin{aligned} &\mathbf{H_0:}\ \mu = 0.\\ &\mathbf{H_A:}\ \mu \neq 0. \end{aligned} \]

you will reject the null hypothesis every time because very few natural phenomena have a true mean of exactly zero. Even if the true mean is 0.00001 you will reject the hypothesis that \(\mu = 0\). To put it bluntly, there is growing frustration in the scientific and statistical community that an over-emphasis on the p-value in presenting scientific results has distracted some researchers from the relevance of the magnitude of the effect that they are observing. For the viewpoint of an international professional body on this matter, see https://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf

15.1 Power calculation in R

Example 15.1 (Power calculation to determine sample size for a single-sample t-test) Using data from an aquaculture farm rearing rainbow trout, we will illustrate the power calculation in a \(t\)-test. If it is important to the farmer to ascertain whether the mean weight of the trout is 500g, as this is the agreed average delivery weight. The ability to detect establish a difference depends on several factors:

the size of the difference between 500g the true mean weight
the size of the sample
the power of the test, i.e. the probability we reject the null hypothesis when the alternative is true
how variable the weight of trout is
the significance level used for the statistical test.

To start with, let’s see how large a sample needs to be if we want to be able to reject, at the 0.05 significance level, the null hypothesis with probability 0.9 when the true mean is 510g. We estimate the variation of the trout weights using an existing available sample. The R function power.t.test gives the sample size required.

# read in our data and get it in the correct format
Trout <- c(508, 479, 545, 531, 559, 422, 547, 525, 420, 491, 508, 511, 569, 
               453, 533, 460, 523, 540, 463, 502)
sd(Trout)
#> [1] 43.1
power.t.test(n = NULL,       # want function to solve for n
             delta = 10,       # effect 10g difference null hypothesis
             sd = sd(Trout),   # estimate std dev from existing data 
             sig.level = 0.05, # the usual significance level
             power = 0.9,      # prob reject null if true mean 510g
             type = "one.sample",
             alternative = "two.sided")
#> 
#>      One-sample t test power calculation 
#> 
#>               n = 197
#>           delta = 10
#>              sd = 43.1
#>       sig.level = 0.05
#>           power = 0.9
#>     alternative = two.sided

It turns out that we would need a sample of 198 trout in order to have a two-sided \(t\)-test with the required power. If the true mean were higher, or the standard deviation were lower then a smaller sample would be sufficient. We won’t show this mathematically, but it makes sense intuitively that detecting a difference is easier if the true difference (delta) is greater or if the parameter (mean) is easier to estimate accurately (smaller standard deviation).

To see this for the current example, firstly increase the effect size so that we are interested in detecting an effect that is a 20g deviation from the mean of the null hypothesis.

  power.t.test(n = NULL,         
             delta = 20,       # effect 20g difference from the null hypothesis     
             sd = sd(Trout),   
             sig.level = 0.05, 
             power = 0.9,      
             type = "one.sample",
             alternative = "two.sided")
#> 
#>      One-sample t test power calculation 
#> 
#>               n = 50.8
#>           delta = 20
#>              sd = 43.1
#>       sig.level = 0.05
#>           power = 0.9
#>     alternative = two.sided

  power2<-power.t.test(n = NULL,         
             delta = 20,       # effect 20g difference from null hypothesis     
             sd = sd(Trout),   
             sig.level = 0.05, 
             power = 0.9,      
             type = "one.sample",
             alternative = "two.sided")

This increase in the effect size from 10g to 20g decreases the required sample size from 198 trout to 51. Now let’s look at the effect on the required sample size of reduced variance. Here we will decrease the standard deviation in the example by 20%.

  power.t.test(n = NULL,         
             delta = 10,       
             sd = sd(Trout)*0.8,  #sd 20% less than previously
             sig.level = 0.05, 
             power = 0.9,      
             type = "one.sample",
             alternative = "two.sided")
#> 
#>      One-sample t test power calculation 
#> 
#>               n = 127
#>           delta = 10
#>              sd = 34.5
#>       sig.level = 0.05
#>           power = 0.9
#>     alternative = two.sided

  power3<-power.t.test(n = NULL,         
             delta = 10,       
             sd = sd(Trout)*0.8,  #sd 20% less than previously
             sig.level = 0.05, 
             power = 0.9,      
             type = "one.sample",
             alternative = "two.sided")

The decrease in the standard deviation from 43.12 to 34.49 decreases the required sample size from 198 trout to 127.

This example shows the most common type of power calculation performed as part of experimental design. That is, given what we know (or can estimate) about the variation in the population, and given the magnitude of the effect that we wish to be able to detect, what size sample is required for the experiment. Another way to look at the power calculation is: based on a fixed sample size, what magnitude effect could be detected with given power?

Example 15.2 (Power calculation to determine effect size for one-sample t-test) The current sample consists of 20 measurements of trout weight. For this sample size, and the estimate of variation from it, what size effect will we be able to detect with 90% power and 5% significance level?

power.t.test(n = 20,           # current sample size is 20 
             delta = NULL,     # want to calculate effect we can detect
             sd = sd(Trout),   # estimate std dev from existing data 
             sig.level = 0.05, # the usual significance level
             power = 0.9,      # prob reject null if true mean 510g
             type = "one.sample",
             alternative = "two.sided")
#> 
#>      One-sample t test power calculation 
#> 
#>               n = 20
#>           delta = 33
#>              sd = 43.1
#>       sig.level = 0.05
#>           power = 0.9
#>     alternative = two.sided

power4<-power.t.test(n = 20,   # current sample size is 20 
             delta = NULL,     # want to calculate effect we can detect
             sd = sd(Trout),   # estimate std dev from existing data 
             sig.level = 0.05, # the usual significance level
             power = 0.9,      # prob reject null if true mean 510g
             type = "one.sample",
             alternative = "two.sided")

With the current sample we would be able to detect when the mean weight differs by 533 with the required power and signifiance level of the two-sided \(t\)-test.

Any statistical hypothesis test has an associated power. We will stick to \(t\)-tests in this chapter, however power calculations can be applied to more complicated hypothesis tests, such as whether the slope parameter in a linear regression is different to zero, or whether a time series of monthly data displays a seasonal effect.

The next variant of the \(t\)-test we will examine is a two-sample \(t\)-test. Suppose that we wish to test whether a particular treatment given to the farmed trout, let’s say a new type of fish feed, results in an increased mean weight of the fish. We could test this by randomly assigning fish to two separate groups, one group to be given the usual feed and another group to be given the new type. After a given period of time the fish can be weighed and we can test whether there is a statistically significant increase in mean weight using a two-sample \(t\)-test. One question we need to ask before conducting such an experiment is: how many trout do we need to have in each group to make the experiement worthwhile and cost-effective?

Although we are looking for an increase, we will gain use a two-sided test just in case the new feed has a negative effect on the fish. If we relied on a one-sided test we would not be able to test for an unexpected result such as a negative effect.

As in the single-sample \(t\)-test, we need to specify the magnitude of the effect that we consider to be of practical significance. In the one-sample scenario, this was the difference from the mean under the null hypothesis. In the two-sample scenario, we need to specify the difference between the two groups that would be of practical significance. Let’s say for now that an deviation in mean weight of 20g would be of interest to us from a commercial point of view.

Example 15.3 (Power calculation to determine sample size for two-sample \(t\)-test)

power.t.test(n = NULL,         # the bit we want a solution for, so it is NULL
             delta = 20,       # the effect we want tp detect
             sd = sd(Trout),   # estimate std dev from existing data 
             sig.level = 0.05, # the usual significance level
             power = 0.9,      # prob reject null if means differ by 20g
             type = "two.sample",
             alternative = "two.sided")
#> 
#>      Two-sample t test power calculation 
#> 
#>               n = 98.6
#>           delta = 20
#>              sd = 43.1
#>       sig.level = 0.05
#>           power = 0.9
#>     alternative = two.sided
#> 
#> NOTE: n is number in *each* group

power5<-power.t.test(n = NULL, # want function to solve for n
             delta = 20,       # want to calculate effect we can detect
             sd = sd(Trout),   # estimate std dev from existing data 
             sig.level = 0.05, # the usual significance level
             power = 0.9,      # prob reject null if true mean 510g
             type = "two.sample",
             alternative = "two.sided")

The sample size required for a two-sided \(t\)-test with the required significance level and power is 99 for a difference of 20g. As the helpful R output reminds us, this is the required sample size for each group. So the total sample we need is 198, where you might round up a little as in practice something always goes a bit wrong, and it is good to make an allowance for that.

15.2 Power calculations in experimental design

Power calculations form an important part of experimental design. These calculations are done before any data are collected. Imagine completing a time-consuming and costly experiment, only to examine the data and discover that

given the variability in the data, your sample size would only be able to detect an effect if it was enormous, or
you collected a sample that was much bigger than was actually necessary and wasted a lot of money/time/plants/animals/good-will etc.

Power calculations are therefore required as part of the ethics approval process for experiments requiring human or animal subjects.

We have seen in the examples above that the power calculations depend on the type of hypothesis that is to be tested with the data. We have chosen simple examples above, but there are corresponding power calculations for more complex hypotheses. For example, these could be testing hypotheses of longitudinal effects in repeated measures data, or testing hypotheses about the parameters in regression models.

If power calculations are smart practice to avoid waste, and power calculations require the hypotheses to be defined, then that means that researchers need to have the analyes and hypothesis tests determined before data are collected. It’s tempting to jump right in and collect data, however without proper experimental design a lot of painstakingly collected data can be worthless. The need for careful experimental design might seem obvious (hopefully!) but it does still happen that major problems in a study are only picked up during the analysis phase, when it can be too late to rectify them.