In this post we’ll look at a very interesting fundamental result in statistics that deals with the following situation. Suppose we are studying a system that is producing data according to some unknown process, and we have already observed \(n\) data points. We are about to observe the next data point, and we’d like to know the probability that this next data point will differ significantly from the mean of the \(n\) data points that we have already observed. We might wonder: can we say anything about this probability without making a lot of assumptions about the process that is generating the data?
It is easy to think of any number of concrete problems where this information would be useful. For example, imagine you run a grocery store and are trying to estimate the weekly demand for a certain item so that you can order enough (but not too much) of that item for next week. In this case, if you have data for the last \(n\) weeks, then a simple strategy would be (for week \(n+1\)) to order an amount equal to the mean of the demand for that item over the last \(n\) weeks. But in that scenario it would also be nice to know the probability that the next week’s demand will differ significantly from the average for the past \(n\) weeks, and having that information may help to mitigate the risk of ordering too much product.
Remarkably, it is possible to put an upper bound on this probability while only making a small number of assumptions about the process that is generating the data. In particular, we do not need to assume that each data point is drawn from any particular probability distribution (e.g., a normal distribution). In fact, we will not even need to assume that the \(n+1\) data points are independent and identically distributed (we will only assume a weaker condition that includes this as a special case).1
The result that we are going to discuss was derived in a 1984 paper by John G. Saw, Mark C. K. Yang, and Tse Chin Mo in the journal The American Statistician (referred to as Saw et al below).2 It is a bit surprising that such a fundamental result was only derived in 1984, since most of the fundamental mathematical results in statistics are much older than that.
Since this result is so interesting, we’ll take an in-depth look at it in this blog post. In particular, we’ll state and then derive a simplified form of the result. To see the original result (which is slightly stronger but more complicated to state), see the original paper mentioned above. There is also a more recent paper by 3
that extends this result to the multivariate case.Statement of the result
We’ll first need to introduce some notation in order to state the result. Let \(X_1,\dots,X_n\) be the first \(n\) observations, and let \(X_{n+1}\) be the next observation after those. Each \(X_j\), for \(j=1,\dots,n+1\), is a random variable taking values in the real numbers. As we mentioned above, we do not need to assume that the \(X_j\) are independent and identically distributed (iid). Instead, we will only assume that these \(n+1\) random variables are exchangeable. For the purposes of this blog post we can define the exchangeability property as follows. Let \(f(x_1,\dots,x_n,x_{n+1})\) be the joint probability density for all the \(X_j\). Then exchangeability means that this probability density is invariant under any permutation \(\tau\) of these random variables,
\[f(x_{\tau(1)},\dots,x_{\tau(n)},x_{\tau(n+1)}) = f(x_1,\dots,x_n,x_{n+1})\ .\]
The exchangeability condition is more general than the iid condition, but iid is a special case of exchangeability. Indeed, in the iid case we have
\[f(x_1,\dots,x_n,x_{n+1}) = \prod_{j=1}^{n+1}f(x_j)\ ,\]
where \(f(x)\) is the probability density common to all of the variables in the iid case. Because of the product form, this joint probability density is clearly invariant under any permutation of the variables.
Given these random variables, let \(\overline{X}_{(n)}\) and \(S_{(n)}^2\) be the sample mean and sample variance of the first \(n\) variables,
\[\begin{align} \overline{X}_{(n)} &= \frac{1}{n}\sum_{j=1}^n X_j \\ S_{(n)}^2 &= \frac{1}{n-1}\sum_{j=1}^n (X_j – \overline{X}_{(n)})^2\ . \end{align}\]
Note that the sample variance is defined with \(n-1\) in the denominator (Bessel’s correction).
The result from Saw et al concerns the probability that the (\(n+1\))th data point \(X_{n+1}\) will differ from the sample mean \(\overline{X}_{(n)}\) by some multiple \(\lambda > 0\) of the sample standard deviation \(S_{(n)}\),
\[ P[|X_{n+1} – \overline{X}_{(n)}| \geq \lambda S_{(n)}]\ .\]
Their result is an upper bound on this probability that decays quadratically with the multiplicative factor \(\lambda\),
\[ P\left[|X_{n+1} – \overline{X}_{(n)}| \geq \sqrt{\tfrac{n+1}{n}} \lambda S_{(n)}\right] \leq \left(\frac{n-1}{n}\right)\frac{1}{\lambda^2} + \frac{1}{n}\ . \tag{1}\]
Note that here that within the square brackets we have rescaled \(\lambda\) by a factor of \(\sqrt{\frac{n+1}{n}}\) in order to make the final bound look a bit neater.4 To take an example, for a sample of size \(n = 50\), the probability that the next data point will be more than three (\(\lambda = 3\)) sample standard deviations from the sample mean is less than or equal to
\[\frac{49}{50}\cdot\frac{1}{9} + \frac{1}{50} = 0.128\dots\ ,\]
or just under 13%.
Relation to the classical Chebyshev inequality
The reason that Saw et al called this bound the “Chebyshev inequality with estimated mean and variance” is that it takes almost exactly the same form as the classical Chebyshev inequality. Recall that if \(X\) is a random variable with well-defined mean \(\mu\) and variance \(\sigma^2\), then Chebyshev’s inequality for \(X\) can be written in the form
\[P[|X-\mu| \geq \lambda\sigma] \leq \frac{1}{\lambda^2}\ .\]
Comparing to the result above, we see that \(\mu\) and \(\sigma\) have been replaced by their finite sample counterparts \(\overline{X}_{(n)}\) and \(S_{(n)}\), and that the right-hand side of the inequality in Eqn. (1) features additional corrections that depend on the sample size \(n\). Note also that as \(n\to\infty\) we have
\[\left(\frac{n-1}{n}\right)\frac{1}{\lambda^2} + \frac{1}{n} \to \frac{1}{\lambda^2}\ ,\]
and so in this large sample limit the right-hand side of Eqn. (1) becomes identical to the right-hand side of the classical Chebyshev inequality. Indeed, in this limit we also expect (in the iid case) that \(\overline{X}_{(n)}\) and \(S_{(n)}\) will approach \(\mu\) and \(\sigma\), provided that \(\mu\) and \(\sigma\) are well-defined for the distribution that the sample is being drawn from.
The most fascinating thing about the result of Saw et al is that their result does not actually require the \(n+1\) variables to be iid. And even in the case where the variables are iid, this result does not require their distribution to have a well-defined mean \(\mu\) or variance \(\sigma^2\), unlike the classical Chebyshev inequality. For example, the result of Saw et al applies to the case where the \(X_j\) are iid and drawn from a Pareto distribution with exponent \(\alpha \leq 1\), which is a classic example of a probability distribution that does not have a finite mean or variance. The classical Chebyshev inequality clearly does not hold in that case, but Eqn. (1) still applies.
Proof of the result
In the last part of this post we will prove the result shown in Eqn. (1). We start by defining the sample mean \(\overline{X}_{(n+1)}\) and sample variance \(S_{(n+1)}^2\) for all \(n+1\) variables,
\[\begin{align} \overline{X}_{(n+1)} &= \frac{1}{n+1}\sum_{j=1}^{n+1} X_j \\ S_{(n+1)}^2 &= \frac{1}{n}\sum_{j=1}^{n+1} (X_j – \overline{X}_{(n+1)})^2\ . \end{align}\]
Again, we have \(n\) and not \(n+1\) in the denominator for the variance because of Bessel’s correction. We’ll now rewrite all of the terms in the probability from Eqn. (1) in terms of \(\overline{X}_{(n+1)}\), \(S_{(n+1)}^2\), and \(X_{n+1}\). In particular, we’ll use the identities
\[X_{n+1} – \overline{X}_{(n)} = \left(\frac{n+1}{n}\right)(X_{n+1} – \overline{X}_{(n+1)})\]
and
\[nS^2_{(n+1)} = (n-1)S^2_{(n)} + \left(\frac{n+1}{n}\right)(X_{n+1} – \overline{X}_{(n+1)})^2\ .\]
These identities can be easily checked if one is willing to work through a bit of algebra.
Next, we note that
\[P\left[|X_{n+1} – \overline{X}_{(n)}| \geq \sqrt{\tfrac{n+1}{n}} \lambda S_{(n)}\right] = P\left[(X_{n+1} – \overline{X}_{(n)})^2\geq \tfrac{n+1}{n} \lambda^2 S_{(n)}^2\right]\ ,\]
which holds because squaring both sides of the inequality inside the square brackets does not change the resulting probability. Then, using the identities from above to replace \(\overline{X}_{(n)}\) and \(S_{(n)}\) with \(\overline{X}_{(n+1)}\) and \(S_{(n+1)}\) on the right-hand side of this equation, we find that
\[P\left[|X_{n+1} – \overline{X}_{(n)}| \geq \sqrt{\tfrac{n+1}{n}} \lambda S_{(n)}\right] = P\left[(X_{n+1} – \overline{X}_{(n+1)})^2\geq \tilde{\lambda}^2 S_{(n+1)}^2\right]\ ,\]
where we defined a new quantity \(\tilde{\lambda}\) that is related to \(\lambda\) by
\[\tilde{\lambda}^2 = \left(\frac{n}{n-1}\right)\frac{\lambda^2}{\left(\frac{n+1}{n}\right)\left(1 + \frac{\lambda^2}{n-1}\right)}\ . \tag{2}\]
Now we come to the key idea of the proof. The reason that we wanted to rewrite everything in terms of \(\overline{X}_{(n+1)}\) and \(S_{(n+1)}^2\) is that these quantities are invariant under any permutation of the \(n+1\) variables \(X_j\). Then, using the exchangeability property of these variables, we find that the original probability that we were studying can be expressed as an average over \(n+1\) different probabilities,
\[P\left[(X_{n+1} – \overline{X}_{(n+1)})^2\geq \tilde{\lambda}^2 S_{(n+1)}^2\right]= \frac{1}{n+1}\sum_{j=1}^{n+1}P\left[(X_j – \overline{X}_{(n+1)})^2\geq \tilde{\lambda}^2 S_{(n+1)}^2\right] \ .\]
Here, each probability in the sum on the right arises from the exchange of \(X_{n+1}\) with \(X_j\) in the probability on the left. But by the exchangeability property, all of these probabilities are equal.
The next step is to note that
\[P\left[(X_j – \overline{X}_{(n+1)})^2 \geq \tilde{\lambda}^2S_{(n+1)}^2\right] = E\left[\mathbb{I}\big[(X_j – \overline{X}_{(n+1)})^2 \geq \tilde{\lambda}^2S_{(n+1)}^2\big]\right]\ ,\]
where \(E[\cdots]\) denotes an expectation value and \(\mathbb{I}[\cdots]\) is the indicator function that equals one if its argument is true and zero if it is not. Then, to obtain an upper bound on our original probability, we need to bound the expectation value of this average of indicator functions:
\[\frac{1}{n+1}\sum_{j=1}^{n+1}\mathbb{I}\big[(X_j – \overline{X}_{(n+1)})^2 \geq \tilde{\lambda}^2S_{(n+1)}^2\big]\ .\]
To bound this average, we’ll return to the definition of the sample variance for all \(n+1\) variables,
\[\sum_{j=1}^{n+1} (X_j – \overline{X}_{(n+1)})^2 = n S_{(n+1)}^2\ .\]
For each term in this sum we note that
\[\begin{align} (X_j – \overline{X}_{(n+1)})^2 &\geq (X_j – \overline{X}_{(n+1)})^2\mathbb{I}\big[(X_j – \overline{X}_{(n+1)})^2 \geq \tilde{\lambda}^2S_{(n+1)}^2\big] \\ &\geq \tilde{\lambda}^2S_{(n+1)}^2\mathbb{I}\big[(X_j – \overline{X}_{(n+1)})^2 \geq \tilde{\lambda}^2S_{(n+1)}^2\big]\ ,\end{align}\]
and so
\[\sum_{j=1}^{n+1} \tilde{\lambda}^2S_{(n+1)}^2\mathbb{I}\big[(X_j – \overline{X}_{(n+1)})^2 \geq \tilde{\lambda}^2S_{(n+1)}^2\big] \leq n S_{(n+1)}^2\ ,\]
or
\[\sum_{j=1}^{n+1} \mathbb{I}\big[(X_j – \overline{X}_{(n+1)})^2 \geq \tilde{\lambda}^2S_{(n+1)}^2\big] \leq \frac{n}{\tilde{\lambda}^2}\ .\]
Putting all of our results together then gives
\[P\left[|X_{n+1} – \overline{X}_{(n)}| \geq \sqrt{\tfrac{n+1}{n}} \lambda S_{(n)}\right] \leq \frac{n}{n+1}\cdot\frac{1}{\tilde{\lambda}^2}\ .\]
Finally, substituting in for \(\tilde{\lambda}\) from Eqn. (2) gives the final result from Eqn. (1) above.
Footnotes
- Because of these minimal assumptions, the bound that we will discuss is usually not the tightest bound possible, but it is still remarkable that such a bound even exists under these circumstances.
- John G. Saw, Mark C. K. Yang, and Tse Chin Mo. The American Statistician, 38:2, 130-132 (1984). https://www.jstor.org/stable/2683249
- The American Statistician, 71:2, 123-127 (2017). https://www.tandfonline.com/doi/full/10.1080/00031305.2016.1186559
- One can also motivate the inclusion of the factor of \(\sqrt{\frac{n+1}{n}}\) here by noting that \(\sqrt{\frac{n}{n+1}}(X_{n+1} – \overline{X}_{(n)})/S_{(n)}\) is the t-statistic that one would use for a two-sample t-test comparing the single sample \(X_{n+1}\) to the sample of size \(n\) consisting of \(X_1,\dots,X_n\).
Comments
3 responses to “Chebyshev’s inequality with sample mean and sample variance”
Pretty cool that anything can be said without knowing the pdf the sample is drawn from. I don’t have a clear feel for the exchangeability condition. Can you provide any intuition?
I have to say though that the upper bound is numerically a little disappointing.
For a lambda of 3 (which is a lot imo) and n=50 (also not a small number), to have only 13% confidence is a bit weak.
Is there a more practical result out there for anomaly detection with comparably few assumptions?
The other standard example of exchangeability (besides the iid case) is the case of simple random sampling without replacement from a finite population. But in terms of getting a feel for the condition, one of the main results about exchangeable random variables and how their joint distribution can differ from the iid case is called de Finetti’s theorem. This article gives a nice intro to it in the context of making inferences from sequences of coin flips: https://www.jstor.org/stable/2683855
And I definitely agree with you that numerically this result is a bit disappointing. I’m sure that’s because it tries to assume as little as possible about the distribution. I suspect that the result can be improved if you make some mild assumptions about the distribution, for example assuming that it belongs to a family of distributions that decay faster than any power law. I don’t know how to improve this specific result in that case, but in my next post I am going to explain a related result where you can make a lot of progress using an assumption like that.
[…] this post we’ll continue exploring the theme of the last post, which was about the probability that a new data point will differ significantly from a set of data […]