Conformal Inference
Arguably, the two major paradigms of statistics are inference and prediction. Although sometimes the machine learning community refers to prediction as predictive inference, it often lacks the ability to quantify uncertainty, which is an essential element of any statistical inference, due to not making any distributional assumption. Even when we do know how to produce a prediction interval, say for a multiple linear regression model, it suffers from undercoverage due to the underestimation of the error standard deviation, $\widehat{\sigma} =\sqrt{\frac{1}{n-p}\sum_{i=1}^n(y_i - \hat{y}_i)^2}$. It’s a simple application of Jensen’s inequality—i.e., $\mathrm{E}(\widehat{\sigma})\leq \sigma$.
Conformal inference, or conformal prediction, is a way to overcome these issues all together and produce a prediction interval that does not under- or overcover. It relies heavily on the ranks of continuous random variables and their stochastic properties.
Ranks are distribution-free!
Order statistics and rank statistics together provide the foundation for distribution-free inference. The $i$-th order statistic is commonly denoted by $X_{(i)}$, where $X_{(1)} \leq X_{(2)} \leq \cdots \leq X_{(n)}$. Then, the rank of $X_i$, denoted by $R_i \in \{1, \ldots,n \}$, is such that $X_i = X_{(R_i)}$. The $R_i$-th order statistic is uniquely defined (i.e., ties can’t arise) almost surely if $X_i$’s are continuous.
An interesting result about the rank statistics is that it is distribution-free under the following 2 assumptions:
- $X_1,\ldots,X_n$ are a random sample (i.e., IID)
- $X_i$ follows a continuous distribution (this prevents the occurrence of ties)
Then, each rank $R_i$ follows a discrete uniform distribution over $\{1,\ldots,n\}$. That is, $\mathrm{P}(R_i = r) = 1/n$. Proving this is simple. Rewrite $R_n:=R(X_n)$ as follows: \begin{align} R(X_n) &= \sum_{j=1}^n I(X_j \leq X_n) = 1+\sum_{j=1}^{n-1}I(X_j \leq X_n) \\ &= 1 + \sum_{j=1}^{n-1}I(F(X_j)\leq F(X_n)) \\ &= 1 + \sum_{j=1}^{n-1}I(U_j \leq U_n) \end{align} Note that $I(\cdot)$ is an indicator function, and $I(X_i \leq X_n) = I(F(X_i) \leq F(X_n))$, where $F$ is the distribution function of $X_n$, which is the distribution function of any arbitrary $X_i$, using IID. By the probability integral transform, $F(X_i) \overset{d}{\equiv} U_i \sim \mathrm{Uniform}(0,1)$. Thus, $$ R_n = 1 + \sum_{i=1}^{n-1}I(U_i \leq U_n). $$ Now, to show that $R_n$ follows a discrete uniform on ${1,\ldots,n}$, we must show $\mathrm{P}(R_n=r)=1/n$—i.e., derive the probability mass function of $R_n$. Note that $I(U_i \leq U_n)$ and $I(U_j \leq U_n)$ for $i\neq j$ are not independent because they both have $U_n$. Therefore, even though we know $I(U_i \leq U_n)$ follows a Bernoulli distribution with probability $1/2$, we can’t use that directly.
Instead, we condition on $U_n$ and use the law of total expectation as follows: \begin{align} \mathrm{P}(R_n=r) &= \mathrm{E}\left\{\mathrm{P}\left(\sum_{i=1}^{n-1}I(U_i \leq U_n) =r-1\mid U_n\right)\right\}. \end{align} Given $U_n$, $I(U_i \leq U_n) \sim \mathrm{Bernoulli}(U_n)$ and $\sum_{i=1}^{n-1}I(U_i \leq U_n) \sim \mathrm{Binomial}(n-1, U_n)$. Therefore, \begin{align} \mathrm{P}(R_n = r) &= \mathrm{E}\left(\binom{n-1}{r-1}U_n^{r-1}(1-U_n)^{n-r} \right)\\ &= \binom{n-1}{r-1}\int_0^1 u^{r-1}(1-u)^{n-r+1-1} ,\mathrm{d}u\\ &= \binom{n-1}{r-1} \mathrm{B}(r,n-r+1) \\ &= \frac{(n-1)!}{(r-1)!(n-r)!}\frac{(r-1)!(n-r)!}{n!}\\ &= \frac{1}{n}. \end{align}
What’s more, this property still holds if we relax the IID assumption to exchangeability.
Exchangeability
$X_1,\ldots,X_n$ are exchangeable if $[X_1,\ldots,X_n] \overset{d}{\equiv} [X_{\pi(1)}, \ldots, X_{\pi(n)}]$, where $\pi:\{1,\ldots,n\}\to \{1,\ldots,n\}$ is a permutation function. Exchangeability essentially means the indices of a set of random variables don’t mean much, and their joint distribution is unaffected by swapping variables.
Unfortunately, we can’t use the nifty tricks we used for the IID case because we can’t assume $X_i$’s marginally have the same distribution (i.e., there’s no single $F$). Instead, we must make our case combinatorially.
Exchangeability tells us that $R_1=r$ has the same probability as $R_{\pi(1)}=r$, and $\pi(1)$ can be any number in $\{1,\ldots,n \}$. Thus, $$ \mathrm{P}(R_1=r)=\mathrm{P}(R_2=r)=\mathrm{P}(R_n=r). $$ Since these are mutually exclusive and exhaustive events, they must add up to 1, which implies each of them must have probability $1/n$. Thus, $\mathrm{P}(R_i=r)=1/n$.
Conformal prediction set
Let’s apply the stuff above to a regression model. Say $Y_1,\ldots,Y_n$ are a random sample from $P$ and let $\sigma_i = \sigma(\{Y_1,\ldots,Y_{n+1}\}, Y_i)$ be the conformity score measuring how conforming $Y_i$ is to $\{Y_1,\ldots,Y_{n+1}\}$. Then, the normalized rank $r_n(Y_{n+1})$ is defined as follows: $$ r_n(Y_{n+1}) = \frac{1}{n+1}\sum_{i=1}^{n+1}I(\sigma_i \leq \sigma_{n+1}), $$ which is distributed uniformly over $\{1/(n+1),2/(n+1),\ldots,1\}$, which we proved in previous sections. We want to derive a probability statement that looks like $\mathrm{P}(r_n(Y_{n+1})\leq \alpha) \leq \alpha$ since then $r_n$ follows the definition of a $p$-value, and we can invert the statement in terms of $Y_{n+1}$. But the discreteness of $r_n$ requires some treatment. Hence comes the following: $$ \mathrm{P}\left(r_n(Y_{n+1}) \leq \frac{\left\lfloor (n+1)\alpha\right\rfloor}{n+1} \right) \leq \alpha, $$ where $\lfloor \cdot \rfloor$ is a floor function. Thus, by inverting the probability statement, we get the conformal prediction set: $$ \Gamma^{(\alpha)} = \left\{y: r_n(y) \geq \frac{\left\lfloor (n+1)\alpha\right\rfloor}{n+1} \right\}. $$ There is a degree of freedom in defining the conformity score, which can be anything as long as it represents how well $Y_i$ blends into the data set, or how similar it is to the data.