Multiple hypothesis testing part 2: the false discovery rate and the Benjamini-Hochberg method

In the first post on this blog we introduced the Multiple Comparisons Problem (MCP), which is the increased risk of false positives (a.k.a. type 1 errors) that we face when testing multiple hypotheses at the same time. In that post we mainly discussed the idea of solving this problem by controlling the family-wise error rate (FWER), which is the probability of rejecting at least one true null hypothesis. We introduced the Holm-Bonferroni method for FWER control, and we proved that it works. If you are unfamiliar with the MCP and the setup of this problem, then we recommend taking a look at that first post before reading this one.

In that first post we also briefly mentioned the fact that there are multiple ways to think about type 1 error control when testing multiple hypotheses. The FWER is appropriate when a single false positive could invalidate your results. However, there are many other kinds of situations where multiple false positives are acceptable as long as those false positives make up a small fraction of the null hypotheses that we decide to reject. In those situations controlling the FWER would be unnecessarily restrictive, and one should instead control a quantity that measures the fraction of the rejected null hypotheses that are false positives.

We will discuss this alternative view of type 1 error control in this post and the next one. In this post we introduce the false discovery rate (FDR), which is a precise measure of the fraction of rejected null hypotheses that are false positives. We also discuss some properties of the FDR, including the fact that a procedure that controls the FDR at a certain level generally has more statistical power than a procedure that controls the FWER at the same level. We then introduce two methods, the Benjamini-Hochberg¹ and Benjamini-Yekutieli² methods, which can be used to control the FDR under various different scenarios of dependence among the multiple hypotheses. We also include code for a Python program that implements these two methods. In the next blog post we’ll present the mathematical proof that these methods control the FDR.

Notation for multiple hypothesis testing

To properly explain the FDR and contrast it with the FWER, we’ll first need to introduce some notation for multiple hypothesis testing. We assume that we are testing \(m > 1\) null hypotheses \(H_1,\dots,H_m\) and that a certain number \(m_0 \leq m\) of these hypotheses are actually true (but we don’t know which). For each hypothesis \(H_i\) (with \(i\in\{1,\dots,m\}\)) the result of our test will be a p-value \(p_i\), and these p-values will be used to decide whether to reject each hypothesis. We also use the notation \(p_{(i)}\) to denote the p-values sorted in ascending order, so that \(p_{(1)}\leq p_{(2)}\leq\cdots\leq p_{(m)}\), and we use \(H_{(i)}\) to denote the hypothesis whose p-value is \(p_{(i)}\).

To make this scenario concrete, it is useful to think about an example where each hypothesis \(H_i\) is evaluated using a one-sided test that involves measuring the value of a test statistic \(X_i\) (a real-valued random variable). In this example, if \(x_i\) is the value of \(X_i\) that we observe in the test, then the p-value \(p_i\) is given by \(p_i = P(X_i \geq x_i)\).

Based on the p-values \(p_i\), we will decide to reject a certain number of hypotheses. We can characterize the decisions we make about all of the hypotheses using the following four random variables:

\(U =\) the number of true null hypotheses that we do not reject
\(V =\) the number of true null hypotheses that we reject
\(T =\) the number of false null hypotheses that we do not reject
\(S =\) the number of false null hypotheses that we reject

A very important point is that, since we do not know which of the null hypotheses are actually true, we cannot directly measure any of these variables. The only variable that we have access to is

\[R = S + V\ ,\]

which is equal to the total number of null hypotheses that we decide to reject based on the observed p-values and our chosen method of analysis. We’ll need to use our knowledge of the p-values, and of \(R\), to make inferences about the values of these other variables. And since we are concerned with false positives, our main focus will be on the variable \(V\).

The definition of the FWER is very straightforward using this notation — we just have

\[\text{FWER} = P(V \geq 1)\ , \]

which is the probability of rejecting at least one true null hypothesis. Recall that the Holm-Bonferroni method, which we discussed in the first blog post, controls the FWER at a specified level \(\alpha\) (a common choice is \(\alpha = 0.05\)) for any set of null hypotheses, even if those hypotheses are not independent of each other.³

We’ll now move on to the definition of the FDR.

The false discovery rate (FDR) and its properties

To define the FDR, we first need to define one more random variable \(Q\). When \(R = 0\) (i.e., when we do not reject any null hypotheses), we define \(Q\) to be \(0\), while for \(R >0\) we define \(Q\) to be the ratio

\[Q = \frac{V}{R} = \frac{V}{S+V}\ .\]

This is equal to the number of true null hypotheses that we rejected divided by the total number of null hypotheses that we rejected. The FDR is is equal to the expectation value of this quantity:

\[\text{FDR} = E[Q]\ .\]

As a side note, the use of the word “discovery” here comes from the fact that a rejected null hypothesis in a study represents a finding that is considered significant, i.e., a scientific discovery. Then \(V/(S+V)\) is equal to the fraction of discoveries that are false.

In a moment we will explain what FDR vs. FWER control implies in practice, but before we do that we want to state an important relation between the FDR and FWER: the FDR is always less than or equal to the FWER,

\[\text{FDR} \leq \text{FWER}\ . \tag{1}\]

This means that any procedure that controls the FWER at a level \(\alpha\) will automatically control the FDR at the same level. It also means that FWER control is more restrictive than FDR control, and this is why methods that control the FDR can have more statistical power than methods that control the FWER. If a lack of statistical power is a concern (e.g., because of a smaller sample size), then you may choose to use a FDR-controlling method over a FWER-controlling method to gain back a bit of power.

To prove the inequality in Eq. (1) one notes that

\[\frac{V}{S+V} \leq \mathbb{I}[V\geq 1]\ , \tag{2}\]

where \(\mathbb{I}[\cdots]\) is the indicator function for the event in square brackets (it is equal to 1 when the event is true and 0 when the event is false). This inequality holds because when \(V=0\) both sides are zero, and when \(V \geq 1\) the left-hand side is between 0 and 1 while the right-hand side is equal to 1. The inequality in Eq. (1) then follows after taking an expectation of both sides of the inequality in Eq. (2).

To see what FDR vs. FWER control implies in practice, it helps to imagine a hypothetical scenario in which we perform the same experiment many times, say 100 times. Each time we do the experiment we test the same \(m\) hypotheses, and we evaluate the significance of our results using either a FDR-controlling method or a FWER-controlling method. In either case, we use a method that controls the FDR or the FWER at a level \(\alpha\). Since our experiments are subject to randomness, the results will not be the same every time, but our chosen method for dealing with false positives will allow us to say something about the overall properties of our results across the 100 experiments. As we will see, the key difference in the implications of FDR control vs. FWER control come from the fact that the FWER is a probability, while the FDR is an expectation value (i.e., an average).

If we do these 100 experiments and interpret them using a method that controls the FWER, then we will find that we rejected at least one true null hypothesis in about \(100\cdot\alpha\) experiments (so about 5 cases if \(\alpha = 0.05\)).⁴ This is because the FWER is the probability of rejecting at least one true null hypothesis.

On the other hand, if we do these 100 experiments and interpret them using a method that controls the FDR, then the conclusion that we can draw is a little different. Let, \(v_t\) and \(s_t\) be the realized values of the random variables \(V\) and \(S\) in experiment \(t\), where \(t\in\{1,\dots,100\}\) (but note that we will never have direct knowledge of \(v_t\) or \(s_t\) alone). Then the conclusion that we can draw based on control of the FDR is that

\[\frac{1}{100}\sum_{t=1}^{100}\frac{v_t}{s_t+v_t} \approx \alpha\ .\]

In other words, we find that the average (over the 100 experiments) of the fraction of rejected null hypotheses that are actually true is approximately equal to \(\alpha\).

The Benjamini-Hochberg and Benjamini-Yekutieli methods

Now that we know what the FDR is and when we might want to use it instead of the FWER, it is time to look at the methods that we can use to control the FDR at a specified rate \(\alpha\). The first method that we’ll look at is the Benjamini-Hochberg method. This method controls the FDR in two different cases: (i) when the hypotheses are independent of each other, and (ii) when the hypotheses have a certain kind of positive dependence/correlation with each other. The kind of positive dependence that is allowed under case (ii) takes a bit of work to explain, so we will not go into it here. We refer readers to the paper of Benjamini and Yekutieli (cited below in footnote 2) for more details on that case.

The Benjamini-Hochberg method works as follows:

Let \(i_{*}\) be the largest \(i\) that satisfies \(p_{(i)} \leq \frac{i}{m}\alpha\). If such an \(i_{*}\) exists, then reject all hypotheses \(H_{(k)}\) with \(k\leq i_{*}\).
If there is no \(i\) that satisfies \(p_{(i)} \leq \frac{i}{m}\alpha\), then do not reject any hypotheses.

This method, as well as the Benjamini-Yekutieli method that we discuss below, actually controls the FDR at a rate that is smaller than \(\alpha\) by a factor of \(m_0/m\),

\[\text{FDR} \leq \frac{m_0}{m}\alpha\ .\]

Since we don’t know \(m_0\) (the number of true null hypotheses), we don’t know the exact value of this upper bound, but we do know that it is always less than or equal to \(\alpha\).

This type of method is sometimes called a “step-down” procedure because it can be implemented in the following manner. Starting with \(i=m\), we check if \(p_{(i)} \leq \frac{i}{m}\alpha\) is satisfied. If it is not, then we move on to \(i = m-1\) and check this condition again, and so on. Since we are starting from the top, the first \(i\) that we encounter that satisfies the condition will be the \(i_{*}\) that we are looking for, so we can stop at that point without having to check if \(p_{(i)} \leq \frac{i}{m}\alpha\) for any \(i < i_{*}\). The code that we will present below works in exactly this manner.

Now there is one non-intuitive property of this method that we need to mention. It is entirely possible that there will be some \(k < i_{*}\) that satisfies \(p_{(k)} > \frac{k}{m}\alpha\), even though the Benjamini-Hochberg method tells us to reject \(H_{(k)}\) since \(k < i_{*}\). This scenario is consistent with the method and is completely fine — it just looks a little strange. It is allowed to happen because the Benjamini-Hochberg method only requires that \(p_{(i)} \leq \frac{i}{m}\alpha\) for \(i = i_{*}\).

We mentioned above that the Benjamini-Hochberg method is valid for independent hypotheses and under a certain kind of positive dependence among the hypotheses. It is of course possible that you will have a set of hypotheses with a dependence that lies outside of those two possibilities. Luckily, there is a simple modification to this procedure that will control the FDR for any type of dependence among the hypotheses. This method was introduced by Benjamini and Yekutieli and is sometimes referred to as the Benjamini-Yekutieli method.

To state the Benjamini-Yekutieli method, we’ll need to define the \(m\)th harmonic number \(h_m\),

\[h_m = \sum_{i=1}^m\frac{1}{i}\ .\]

With this definition, the Benjamini-Yekutieli method works as follows:

Let \(i_{*}\) be the largest \(i\) that satisfies \(p_{(i)} \leq \frac{i}{m\cdot h_m}\alpha\). If such an \(i_{*}\) exists, then reject all hypotheses \(H_{(k)}\) for \(k\leq i_{*}\).
If there is no \(i\) that satisfies \(p_{(i)} \leq \frac{i}{m\cdot h_m}\alpha\), then do not reject any hypotheses.

The only difference between this method and the Benjamini-Hochberg method is the replacement of \(m\) with the product \(m\cdot h_m\) in the denominator of the right-hand side of the inequality \(p_{(i)} \leq \frac{i}{m}\alpha\) that we check for each \(i\). This adjustment makes the method slightly less powerful, but since the \(m\)th harmonic number \(h_m\) only grows like \(\ln(m)\) at large \(m\), the difference is not very significant and we don’t lose too much power.

Here is some code for a Python function that will perform both of these methods depending on the user’s input. Stay tuned for the next post, where we’ll present the proof that these methods work as advertised.

Note: beware that in your browser some lines of this code may get split into multiple lines. The cases where this occurs should be obvious but if you copy and paste this code you may have to fix the formatting before it will run properly.

import numpy as np

# This function computes the m^th harmonic number h_m.
def harmonic_number(m):
    h_m = 0
    
    for i in range(1, m + 1):
        h_m += 1 / i
        
    return h_m   

# The next function carries out the Benjamini-Hochberg
# or Benjamini-Yekutieli method depending on the inputs.

# Inputs:
# (1) p = a Python list of the p-values from the tests of
# the multiple hypotheses. It does not need to be sorted. 
# (2) alpha = the desired significance level. It is set to 
# 0.05 by default.
# (3) method = 'BH' (for Benjamini-Hochberg) or 'BY' (for
# Benjamini-Yekutieli). 

# Output: 
# (1) sig_hypotheses = a list containing the indices of the 
# p-values in p whose corresponding null hypotheses should 
# be rejected according to the method used. Recall that
# Python indexing starts at 0, not 1.

def multiple_hypothesis_tester(p, alpha = 0.05, method):
        
    m = len(p)
    
    p_sorted = np.sort(p)
    indices = np.argsort(p)
    
    if method == 'BH':
        comparison_values = np.array([i * alpha / m for i in      range(1, m + 1)])
    elif method == 'BY':
        h_m = harmonic_number(m)
        comparison_values = np.array([i * alpha / (m * h_m) for i in range(1, m + 1)])
    else:
        print('Please enter a valid method: BH or BY.')

        return []
    
    # This variable will be updated to the first index such that
    # p_sorted[k] <= comparison_values[k].
    k = None 
    
    # Initialize a variable that will count down to 0 from m - 1.
    i = m - 1
    
    while (k is None) and (i >= 0):
        if p_sorted[i] <= comparison_values[i]:
            k = i
        else:
            pass
        
        i -= 1
    
    if k is not None:
        sig_hypotheses = list(indices[:k + 1])
    else:
        sig_hypotheses = []
    
    return sig_hypotheses

Footnotes

Yoav Benjamini and Yosef Hochberg. Journal of the Royal Statistical Society. Series B (Methodological), 57:1, 289–300 (1995). https://www.jstor.org/stable/2346101
Yoav Benjamini and Daniel Yekutieli. Annals of Statistics. 29:4, 1165–1188 (2001). https://doi.org/10.1214/aos/1013699998
If we test the hypotheses using a set of test statistics \(X_i\), then the hypotheses \(H_i\) will only be independent of each other if the random variables \(X_i\) are independent of each other.
The reason that the number of cases will be about \(100\cdot\alpha\) instead of exactly that number is that we have a finite sample of experiments, so the observed proportion of these experiments where \(V \geq 1\) will approach, but not be exactly equal to, the true probability \(P(V\geq 1)\). The same reasoning holds for the sample average that we discuss in the next paragraph for the FDR case.