Given an input feature vector $x$, a traditional regression algorithm outputs a point prediction $\hat{y}$ that represents the algorithm’s best estimate of the true label $y$. However, traditional regression algorithms typically lack a measure of confidence in their predictions. Specifically, they do not indicate how close $\hat{y}$ is likely to be to the true label $y$, or how much $\hat{y}$ might deviate from $y$.
Conformal prediction (CP) addresses this by supplementing a traditional regression algorithm with a measure of confidence. CP takes the point predictions produced by a traditional regression algorithm and outputs a prediction region that is valid at a user-defined confidence level, such as $(1 - \epsilon) \times 100\%$.
Training Examples: $T^* = \{(X_1, Y_1), (X_2, Y_2)...\}$, where $i = 1, ..., l, X_i\in \mathbb{R}^d, Y_i \in \mathbb{R}$.
Point Prediction: $\hat{Y}{l+1}:=f{T^}(X_{l+1})$, where $f_{T^}: \mathbb{R}^d \rightarrow \mathbb{R}$
In conformal prediction, we make set predictions instead of point predictions. Given a significance level $\epsilon \in (0,1)$, a CP outputs a prediction set $\Gamma^\epsilon(T^*, X_{l+1}) \subseteq \mathbb{R}$ such that
$$ P \left( Y_{l+1} \in \Gamma_\epsilon(T^*, X_{l+1}) \right) \geq 1 - \epsilon $$
More precisely, the conformal prediction framework only requires the data sequence to be exchangeable, of which i.i.d. is a special case.
The residuals on the training set $R_i = |Y_i - f_{T^}(X_i)|$, individually, each residual gives us the magnitude of the prediction error on a training example when $f_{T^}(X_i)$ is used as the prediction function. Together, they form an empirical distribution of errors, which allows us to estimate quantities such as the empirical quantile. Hence, it seems reasonable to construct a prediction set as
$$ \Gamma_\epsilon (T^, X_{l+1}) = \left[ f_{T^} (X_{l+1}) - q_\epsilon, f_{T^*} (X_{l+1}) + q_\epsilon \right] $$
where $q_\epsilon$ is the $(1-\epsilon)$-empirical quantile of values $R_i$.