When testing multiple hypotheses at the same time, we must be careful when analyzing the results. A naive analysis of the data will greatly increase our risk of making a type 1 error (i.e., reaching a false positive conclusion on one of the hypotheses). This issue is known as the Multiple Comparisons Problem (MCP) and it is important in many areas including clinical drug trials. Because it is so important, the MCP has been studied in depth, and in this post we’ll look at one method that can be used to correct for this problem.
We’ll start with a brief sketch of the problem. Suppose you are testing \(m\) hypotheses and you test each hypothesis using the same significance level \(\alpha\) (e.g., \(\alpha = 0.05\) for the usual 5% significance level). If you were only testing one hypothesis, then this would guarantee that your probability of rejecting your null hypothesis when it is true is bounded above by \(\alpha\). However, this is no longer the case when testing multiple hypotheses. In the simplest case, when all of your null hypotheses are true and they are all independent of each other (a poor assumption that we will not use later), your probability of rejecting at least one true null hypothesis is now \(1 – (1-\alpha)^m\), and this can be much larger than \(\alpha\) as \(m\) increases. For example, with \(\alpha = 0.05\) and \(m = 3\) we have \(1 – (1-\alpha)^m = 0.1426…\)
This example makes it clear that we must adjust our significance levels to properly control for type 1 errors when testing multiple hypotheses. In this post we are going to discuss one method for doing that, namely, the Holm-Bonferroni (HB) method.1 We chose this method because it has two nice features: (i) it’s simple to implement, and (ii) it’s valid even if the multiple hypotheses are not independent of each other. As a bonus, this method always has more statistical power than the well-known Bonferroni correction (just testing all hypotheses at the modified significance level \(\alpha/m\)), and it is valid under the same general conditions.
Before we go over the HB method we need to mention that there are multiple ways to think about the control of type 1 errors when testing multiple hypotheses. In our example above we were concerned with the probability of rejecting at least one true null hypothesis. This probability is known as the family-wise error rate (FWER), and the HB method is designed to control it. If reaching even one false positive conclusion will be detrimental to your work, then you should use a method that controls the FWER.
In other problems, a single false positive conclusion is not a major concern, and one is more concerned with detecting as many true effects (false null hypotheses) as possible. In these cases it is better to control another quantity, the false discovery rate (FDR), which is equal to the fraction of your rejected null hypotheses that are actually true. [More precisely, it is equal to the expectation value of that fraction, since that fraction is a random variable that could take on different values if you repeated your experiment multiple times.] Another way to say it is, the FDR is equal to the expected fraction of your “discoveries” (i.e., your rejected null hypotheses) that are false positives. Methods that control the FDR at a level \(\alpha\) generally have more power than methods that control the FWER at the same level. We may discuss methods for controlling the FDR in a future blog post.
With this background in place, we’ll now present the HB method and then prove that it works as advertised. We assume that we are testing \(m > 1\) hypotheses and we want to control the FWER at a level \(\alpha\). Denote by \(H_1,\dots,H_m\) the \(m\) null hypotheses. We test hypothesis \(H_i\), for \(i\in\{1,\dots,m\}\), by measuring the value of a test statistic \(X_i\), which we assume is a scalar-valued random variable. We do not assume the different hypotheses are independent of each other, which means that we do not assume the \(X_i\) are independent of each other.
To simplify the notation in the proof below we’ll assume that we are performing a one-sided test of each hypothesis, meaning that we will reject hypothesis \(H_i\) if \(X_i\) is found to be greater than some critical value \(c_i\). The generalization to other kinds of tests (e.g., two-sided) is not difficult. When applying the standard Bonferroni correction the critical values \(c_i\) are all defined by the relation \(P_0(X_i \geq c_i) = \alpha/m\), where \(P_0(\cdots)\) denotes the probability of the event in parentheses under the null hypothesis.2 As we will see below, the definition of \(c_i\) is more complex for the HB method, and each \(c_i\) will actually depend on the values of all the test statistics, so that
\[c_i = c_i(X_1,\dots,X_m)\ .\]
When we test each hypothesis \(H_i\) we compute a p-value \(p_i\) that is equal to the probability, under the null hypothesis, of observing a value of \(X_i\) that is equal to or more extreme than the value we found in the test. If \(x_i\) is the value we observe for \(X_i\), then \(p_i = P_0(X_i \geq x_i)\). We then order the p-values from least to greatest. Let \(p_{(i)}\) denote the p-values arranged in ascending order (i.e., \(p_{(1)}\leq p_{(2)}\leq\cdots\leq p_{(m)}\)), and let \(H_{(i)}\) be the hypothesis whose p-value is \(p_{(i)}\). With this notation in place, the HB method works as follows.
Holm-Bonferroni method:
-
Starting with \(i = 1\), if \(p_{(i)} \leq \frac{\alpha}{m – i + 1}\), then reject \(H_{(i)}\) and proceed to check this condition for \(i + 1\). In other words, starting with \(i = 1\), compare \(p_{(1)},p_{(2)},\dots,p_{(m)}\) with the adjusted significance levels \(\frac{\alpha}{m},\frac{\alpha}{m-1},\dots,\alpha\).
-
As soon as you encounter an \(i\) with \(p_{(i)} > \frac{\alpha}{m – i + 1}\), stop and do not reject any \(H_{(k)}\) with \(k\geq i\).
Here is some code for a Python function that will perform the HB method:
import numpy as np
# Inputs:
# (1) p = a Python list of the p-values from the tests of
# the multiple hypotheses. It does not need to be sorted.
# (2) alpha = the desired significance level. It is set to
# 0.05 by default.
# Output:
# (1) sig_hypotheses = a list containing the indices of the
# p-values in p whose corresponding null hypotheses should
# be rejected according to the Holm-Bonferroni method. Recall
# that Python indexing starts at 0, not 1.
def holm_bonferroni(p, alpha = 0.05):
m = len(p)
sig_hypotheses = []
p_sorted = np.sort(p)
indices = np.argsort(p)
i = 0
keep_going = True
while ((i < m) and keep_going is True):
if p_sorted[i] <= alpha / (m - i):
sig_hypotheses.append(indices[i])
i += 1
else:
keep_going = False
return sig_hypotheses
Since the HB method starts out by sorting the p-values from smallest to largest, the critical value \(c_i\) for rejecting hypothesis \(H_i\) depends on the values of all the test statistics (not just \(X_i\)), as we indicated above. This makes it difficult to compute the \(c_i\) explicitly, but it turns out that we can still use the HB method (and prove that it works) without knowing the exact values of the \(c_i\).
We’ll now prove that this method does indeed control the FWER at a level \(\alpha\). Our proof combines elements of Holm’s original proof and the proof on the Wikipedia page about the HB method. To start, assume that the set of \(m\) null hypotheses contains \(m_0 \geq 1\) hypotheses that are actually true (we don’t know which ones they are), and let \(j \in\mathcal{J}_0\) index these true null hypotheses. Here, the set \(\mathcal{J}_0\) is of size \(m_0\) and contains the indices of the true null hypotheses. Note that we can restrict our attention to the case where \(m_0\geq 1\) since the FWER is zero if \(m_0=0\).
Let \(A\) be the event in which we reject at least one true null hypotheses, so that \(\text{FWER} = P(A)\). We can write \(A = \cup_{j\in\mathcal{J}_0}A_j\), where \(A_j\) is the event that we reject the true null hypothesis \(H_j\), but this is not very helpful since the probabilities of any of these events are hard to quantify directly. Instead, we’ll proceed as follows. First, we’ll show that the occurrence of event \(A\) implies the occurrence of a certain other event \(B\) (i.e., \(A \subset B\)), so that \(P(A) \leq P(B)\). Next, we’ll see that \(B\) also has a decomposition \(B = \cup_{j\in\mathcal{J}_0}B_j\), where we can show that \(P(B_j) \leq \alpha/m_0\) for each individual event \(B_j\). Finally, we’ll put all these ingredients together and use Boole’s inequality (second to third line below) to obtain
$$\begin{aligned} P(A) &\leq P(B) \\ &= P(\cup_{j\in\mathcal{J}_0}B_j) \\ &\leq \sum_{j\in\mathcal{J}_0}P(B_j) \\ &\leq \alpha\ ,\end{aligned}$$
where the last inequality follows from \(P(B_j) \leq \alpha/m_0\) and \(|\mathcal{J}_0|=m_0\).
All that remains is to identify the events \(B\) and \(B_j\), and to show that \(P(B_j) \leq \alpha/m_0\). To accomplish this, consider the first true null hypothesis that we reject when carrying out the HB method (recall that we assume \(m_0\geq 1\)). This will be the true null hypothesis with the smallest p-value among all the true null hypotheses. Let that hypothesis have p-value \(p_{(h)}\) in the ordering of the p-values from smallest to largest. Since we rejected this hypothesis, we must have
\[p_{(h)} \leq \frac{\alpha}{m – h + 1}\ .\]
In addition, since it is the first true null hypothesis that we encountered in the HB method, we must have \(m_0 \leq m – (h – 1)\), or \(m_0 \leq m – h + 1\), and so we also have
\[p_{(h)} \leq \frac{\alpha}{m_0}\ .\]
Now, for each test statistic \(X_i\) let us define a modified critical value \(\tilde{c}_i\) by the relation3
\[P_0(X_i \geq \tilde{c}_i) = \alpha/m_0\ .\]
Then if we define \(B\) to be the event that \(X_j \geq \tilde{c}_j\) for at least one \(j\in\mathcal{J}_0\), the discussion above shows that \(A\) implies \(B\) or \(A\subset B\). We also have \(B = \cup_{j\in\mathcal{J}_0}B_j\) where \(B_j\) is the event that \(X_j \geq \tilde{c}_j\). Finally, by definition of the \(\tilde{c}_j\) we have \(P(B_j) \leq \alpha/m_0\) (this is actually an equality when the \(X_i\) are continuous), which completes the proof.
Footnotes
- See here for Holm’s original paper.
- This is the correct definition of the critical value for a continuous random variable. For a random variable \(X\) that can only take on a set of discrete values \(x_k\), for \(k\) in some index set, one possible definition of the critical value \(c\) is to take \(c\) to be equal to the smallest \(x_k\) that satisfies \(P_0(X \geq x_k) \leq \alpha/m\). Then we always have \(P_0(X \geq c) \leq \alpha/m\). In this post we focus on the continuous case to make the proof a bit shorter.
- Here, to keep things simple we are again assuming the \(X_i\) are continuous.
Comments
2 responses to “Multiple hypothesis testing and the Holm-Bonferroni method”
I enjoyed this. Thanks Matt.
Thanks Eugeniu! I hope you are doing well.