Using data to bound the probability of a rare event

Note: the code used to generate the numbers and figures in this post can be found here in the “rare-event-probability” repository on my GitHub page.

In this post we’ll continue exploring the theme of the last post, which was about the probability that a new data point will differ significantly from a set of data points that were previously generated by the same process or system. Our main example in the last post was the weekly demand for some item at a supermarket. If we know how many of this item were sold each week for the last \(n\) weeks, can we say anything about the probability that next week’s demand will differ significantly from that of the last \(n\) weeks?

In the last post we specifically looked at an upper bound on the probability that the new data point will differ significantly from the sample mean of the previously collected data points. This time we’ll instead look at the probability that the new data point is larger than the maximum of all the previously collected data points. Schematically, we want to know

\[P\left(\text{new data point} \geq \lambda\cdot(\text{max of previous data points})\right)\ ,\]

where \(\lambda \geq 1\) is a multiplicative factor that tells us how big the new data point is compared to the maximum of the previous data points. Because of our focus on the maximum of the previously collected data, this blog post is really about rare events, i.e., new data points that are more extreme (in this case, much larger) than any data point seen so far.¹

As in the last post, we want to be able to say something meaningful about this probability without making overly restrictive assumptions about the process that is generating the data. This is for the simple reason that we usually don’t know the exact details of the process that is generating the data that we are measuring. If we can obtain strong results without making restrictive assumptions, then our analysis will obviously be much more valuable.

We’ll start by setting up the notation and context for this problem. Then we’ll look at an upper bound on this probability for data drawn from a large class of distributions with thin tails. This class includes many familiar distributions with rapid decay, including the normal, exponential, and gamma distributions. Next, we’ll look at the same probability for an example of a distribution with a slow power law decay, namely the Pareto distribution, and we’ll compare the probability of a rare event in this distribution with our result for the family of thin-tailed distributions. Finally, we’ll return to the thin-tailed case and present a proof of our probability bound for that case.

Setup of the problem

We’ll consider making \(n+1\) independent observations of a system described by the probability distribution \(f(x)\). We’ll denote these observations by \(X_j\) for \(j = 1, 2, 3,\dots, n, n+1\). Here, the \(X_j\) are independent and identically distributed (i.i.d.) random variables, so \(X_j\) has the probability density \(f(x_j)\), and the joint probability density for all the variables is \(\prod_j f(x_j)\). We’ll also assume that the \(X_j\) are non-negative, which makes sense in our context.

We are interested in the probability that the last data point \(X_{n+1}\) is larger than the maximum of \(X_1,\dots,X_n\) by a factor of \(\lambda \geq 1\),

\[P(X_{n+1} \geq \lambda Y_n)\ ,\]

where

\[Y_n = \max(X_1,\dots,X_n)\]

is the maximum, or largest order statistic, of the first \(n\) observations.

In terms of the probability density \(f(x)\) that characterizes the \(X_j\), the probability density \(g(y)\) for \(Y_n\) is given by

\[g(y) = nf(y)F(y)^{n-1}\ ,\]

where \(F(x) = \int_0^x dx’ f(x’)\) is the cumulative distribution function for \(f(x)\). For those who are unfamiliar with order statistics, a simple argument for deriving \(g(y)\) is as follows. By definition, \(g(y)dy\) is the probability that \(Y_n\) lies in the infinitesimal interval \((y,y+dy)\). In order for that to happen, \(n-1\) of the \(X_j\) must be less than \(y\), and exactly one of the \(X_j\) should be within the interval \((y,y+dy)\). The probability that one of the \(X_j\) is less than \(y\) is \(F(y)\), while the probability that one of the \(X_j\) lies within \((y,y+dy)\) is \(f(y)dy\). Finally, since there are \({n \choose 1} = n\) ways to choose which of the \(X_j\) to put in the interval \((y,y+dy)\), we need to multiply \(f(y)F(y)^{n-1}\) by \(n\) to get the expression for \(g(y)\) shown above.

Now that we know the probability distribution for \(Y_n\), we can use the fact that \(Y_n\) and \(X_{n+1}\) are independent to immediately write down a formula for the probability that we are interested in:

\[\begin{align} P(X_{n+1} \geq \lambda Y_n) &= \int_0^{\infty}dy\ g(y)\int_{\lambda y}^{\infty} dx\ f(x) \\ &= \int_0^{\infty}dy\ g(y)[1-F(\lambda y)] \\ &= n\int_0^{\infty}dy\ f(y)F(y)^{n-1}[1-F(\lambda y)]\ . \end{align}\]

At this point it is difficult to make progress without knowing a little more information about the distribution \(f(x)\). However, we can make at least one interesting observation. For the case where \(\lambda =1\), we have

\[\begin{align} P(X_{n+1} \geq Y_n) &= n\int_0^{\infty}dy\ f(y)F(y)^{n-1}[1-F(y)] \\ &= n\int_0^1 du\ u^{n-1}(1-u) \\ &= \frac{1}{n+1}\ ,\end{align}\]

where we evaluated the integral via the u-substitution \(u = F(y)\) (and then \(du = f(y) dy\)). Then we find that, for any distribution whatsoever, the probability that the (\(n+1\))th data point will be larger than any of the previous \(n\) data points is exactly equal to \(1/(n+1)\).

Upper bound for a family of thin-tailed distributions

We’ll now derive an upper bound for \(P(X_{n+1} \geq \lambda Y_n)\) that holds for a large family of distributions with thin tails, namely the log-concave distributions.² This family includes many familiar distributions including the normal, exponential, and gamma distributions, among others. As we mentioned in the introduction, we are interested in deriving useful and general results that do not require us to make overly restrictive assumptions about the form of the distribution \(f(x)\). The result that we will derive in this section holds for all of the distributions that we just mentioned, and so it is exactly the kind of result we are looking for.

The log-concave distributions have the property that \(\ln(f(x))\) is a concave function (here, \(\ln(\cdots)\) denotes the natural logarithm). This property has many interesting consequences, but for our purposes here we will only need one key result:

If \(f(x)\) is log-concave and continuously differentiable, then the tail probability function \(\overline{F}(x) = 1 – F(x)\)³ obeys the inequality

\[\overline{F}(\lambda x) \leq (\overline{F}(x))^{\lambda}\ . \tag{1}\]

To gain some intuition for this result, note that this inequality actually becomes an equality for any exponential distribution

\[f(x) = \frac{1}{a}e^{-x/a}\ ,\]

where \(a > 0\) is the scale factor for the exponential distribution. This is the case because the tail function of the exponential distribution takes the simple form

\[\overline{F}(x) = e^{-x/a}\ .\]

Then the result in Eq. (1) can be interpreted as saying that the tail probability function for any log-concave distribution exhibits a decay that is at least as fast as that of the exponential distribution.

Using the result from Eq. (1), we find that

\[\begin{align} P(X_{n+1} \geq \lambda Y_n) &= n\int_0^{\infty}dy\ f(y)F(y)^{n-1}\overline{F}(\lambda y) \\ &\leq n\int_0^{\infty}dy\ f(y)F(y)^{n-1}(\overline{F}(y))^{\lambda} \\ &= n\int_0^1 du\ u^{n-1}(1-u)^{\lambda}\ , \end{align}\]

where we used log-concavity to go from the first line to the second, and we then applied the same u-substitution as before. If we now recall the definition of the beta function \(B(a, b)\),

\[B(a,b) = \int_0^1 du\ u^{a-1}(1-u)^{b-1}\ ,\]

then we obtain our final result that

\[P(X_{n+1} \geq \lambda Y_n) \leq n B(n, \lambda+1)\ . \tag{2}\]

We note again that this inequality also becomes an equality for the case of the exponential distribution.

To gain some intuition for this upper bound, and to see that it is actually quite useful, let’s look at a few numerical examples:

For \(n = 25\) and \(\lambda = 2\), \(n B(n, \lambda+1) = 0.00285\)
For \(n = 25\) and \(\lambda = 5\), \(n B(n, \lambda+1) = 7.02\cdot 10^{-6}\)

These numbers are already quite small, even at \(n=25\), and so this bound seems to do a good job of tightly constraining the probability of a rare event.

Comparison with the Pareto distribution

We’ll now contrast the upper bound from the previous section with the exact result for a typical family of distributions with fat tails, namely the Pareto distributions. The fat tails of these distributions come from the power law decay of their probability density function \(f(x)\), which decays as \(x^{-(\alpha + 1)}\) for some power \(\alpha > 0\).

Each Pareto distribution is actually characterized by two parameters, a minimum value \(x_m > 0\) and a power \(\alpha > 0\). A random variable \(X\) with this Pareto distribution can only take on values greater than or equal to \(x_m\), and its probability distribution takes the form

\[f(x) = \frac{\alpha x_m^{\alpha}}{x^{\alpha+1}}\ .\]

For this distribution we also have \(F(x) = 1 – \left(\frac{x_m}{x}\right)^{\alpha}\) and \(\overline{F}(x) = \left(\frac{x_m}{x}\right)^{\alpha}\).

The tail function of the Pareto distribution has the scaling property that

\[\overline{F}(\lambda y) = \frac{1}{\lambda^{\alpha}}\overline{F}(y)\ ,\]

which allows us to immediately evaluate the expression for \(P(X_{n+1} \geq \lambda Y_n)\),

\[\begin{align} P(X_{n+1} \geq \lambda Y_n) &= n\int_0^{\infty}dy\ f(y)F(y)^{n-1}\overline{F}(\lambda y) \\ &= \frac{n}{\lambda^{\alpha}}\int_0^{\infty}dy\ f(y)F(y)^{n-1}\overline{F}(y)\\ &= \frac{1}{n+1}\cdot\frac{1}{\lambda^{\alpha}}\ .\end{align}\]

Therefore we find that for the Pareto distribution, the probability of a rare event decays with the same power \(\alpha\) as the tail function \(\overline{F}(x)\).

As you can see in the figure below, which looks at the case where \(n=25\), the probability of a rare event decays much more slowly with \(\lambda\) for the Pareto distributions as compared with our result above for the log-concave distributions.

Proof of the upper bound for log-concave distributions

In this final section we’ll prove the bound in Eq. (1), which was the key ingredient in deriving the bound on \(P(X_{n+1} \geq \lambda Y_n)\) from Eq. (2). We’ll also make a few comments on the log-concave condition and note that it is actually a much stricter condition than we really need to derive Eq. (2). This will imply that the bound in Eq. (2) actually holds for a family of distributions that is larger than the family of log-concave distributions.

To start the proof of Eq. (1), we’ll first use a general property⁴ of log-concave distributions:

If \(f(x)\) is log-concave and continuously differentiable, then the tail probability function \(\overline{F}(x)\) is also log-concave.

Let us write \(\overline{F}(x)\) in the form

\[\overline{F}(x) = e^{-\phi(x)}\ , \]

where we have introduced a new function \(\phi(x)\) whose properties we will explain shortly. In general, \(\overline{F}(x)\) is less than or equal 1, is a decreasing (or at least non-increasing) function of \(x\), and also satisfies \(\overline{F}(0) = 1\). This means that \(\phi(x)\) is greater than or equal to zero, is an increasing (or at least non-decreasing) function of \(x\), and also satisfies \(\phi(0) = 0\). Finally, using the result on log-concavity that we quoted above, we find that \(\phi(x)\) is convex if \(f(x)\) is log-concave and continuously differentiable.

Since \(\phi(x)\) is convex, we can apply one of the standard properties of convex functions, which is that these functions always lie above their tangents. In particular, for any \(x\) and \(y\) we have (with \(\phi'(x) = d\phi(x)/dx\))

\[\phi(y) \geq \phi(x) + \phi'(x)(y-x)\ . \tag{3}\]

If we consider this condition for the specific case where \(y = 0\), then we find (using \(\phi(0) = 0\)) that

\[x\phi'(x) \geq \phi(x)\ , \tag{4}\]

which in turn implies that

\[\frac{\phi(x)}{x} \tag{5}\]

is a non-decreasing function (to see this note that the inequality in Eq. (4) implies that the first derivative of \(\phi(x)/x\) is greater than or equal to zero).

Finally, the fact that the function in Eq. (5) is non-decreasing implies that, for \(\lambda \geq 1\), \(\phi(\lambda x)/(\lambda x) \geq \phi(x)/x\), or

\[\phi(\lambda x) \geq \lambda\phi(x)\ .\]

This in turn implies Eq. (1), which was the key to our proof of Eq. (2).

We’ll close out this section by reviewing these results and thinking critically about what conditions we actually need to derive Eq. (1). In fact, the only condition we really needed was the fact that \(\phi(x)/x\) is a non-decreasing function. We did not actually need to assume that \(\phi(x)\) was convex (indeed, Eq. (3) is a much more restrictive condition than Eq. (4)), and we definitely did not need to assume that \(f(x)\) itself was log-concave. The log-concave condition was merely a convenient tool for deriving the condition in Eq. (4) (and log-concave distributions are a well-known subject area in statistics, so they make sense as a starting point for a result like this).

What this means is that our results in Eqs. (1) and (2) will actually hold for any probability distribution whose tail probability function, when written in the form \(\overline{F}(x) = e^{-\phi(x)}\), has the property that \(\phi(x)/x\) is a non-decreasing function. It would be interesting to find some specific examples of such distributions where \(f(x)\) is not log-concave, but we don’t currently know of any examples where the random variable can also take any value from \(0\) to \(\infty\).

Footnotes

For a very interesting discussion of rare events that are also highly consequential, I strongly recommend the book The Black Swan: The Impact of the Highly Improbable by Nassim Nicholas Taleb.
An excellent reference on these distributions can be found here: https://escholarship.org/uc/item/62c3d5c4.
Recall that \(\overline{F}(x) = P(X \geq x)\) is just the probability that the random variable \(X\) is larger than \(x\).
This property appears in Theorem 3 of the reference that we mentioned in the first footnote.