---
title: "Prediction and Prediction Variability"
output: pdf_document
---


We have seen in lectures that when considering the variability of predictions for new $x$ values, we may wish to account for future residual errors in the calculation.  Specifically, our model
\[
Y_i = \beta_0 + \beta_1 x_{i1} + \epsilon_i
\]
is regarded as relating to \textbf{all} outcome data we will observe, whether they be part of the original sample based on predictor values $x_{i1},i=1,\ldots,n$ or part of the `new' sample with predictor values $x_{i1}^\text{new},i=1,\ldots,m$.  We noted the relationship between a predicted value at $x_{i1}^\text{new}$ \textit{\textbf{without}} residual error, $\widehat Y_{i}^\text{new}$, and the prediction  \textit{\textbf{with}} residual error, $\widehat Y_{\text{O}i}^\text{new}$, as
\[
\widehat Y_{i}^\text{new} = \widehat \beta_0 + \widehat \beta_1 x_{i1}^\text{new}
\]
with estimators $(\widehat \beta_0, \widehat \beta_1)$, whereas
\begin{align*}
\widehat Y_{\text{O}i}^\text{new} & = \widehat Y_{i}^\text{new} + \epsilon_i^\text{new} \\[6pt]
& = \widehat \beta_0 + \widehat \beta_1 x_{i1}^\text{new} + \epsilon_i^\text{new}.
\end{align*}
In both cases, the point prediction is simply $\widehat y_i^\text{new} = \textbf{x}_i^\text{new} \widehat \beta$, but the associated uncertainty intervals are different: for a $(1-\alpha) 100 \%$ interval, we have
\[
\begin{array}{rcl}
  \textrm{Confidence interval} & : & \widehat y_i^\text{new} \pm t_{\alpha/2,n-2} \sqrt{\left(\widehat \sigma^2 \textbf{H}^\text{new}\right)_{ii}} \\[6pt]
  \textrm{Prediction interval} & : & \widehat y_i^\text{new} \pm t_{\alpha/2,n-2} \sqrt{\widehat \sigma^2 \left(\textbf{I}_m + \textbf{H}^\text{new}\right)_{ii}}
\end{array}
\]
where
\[
\textbf{H}^\text{new} = \textbf{X}^\text{new} (\textbf{X}^\top \textbf{X})^{-1} \left\{ \textbf{X}^\text{new} \right\}^\top.
\]


To illustrate the difference between a confidence interval and a prediction interval, consider the following experiment.  We observe $n=100$ data points in an initial data batch, and then $m=20$ data points subsequently, with all observations independent.  Thus we have a total of 120 observations.   The data are simulated using the model
\[
Y_i = 2 + 3 x_{i1} + \epsilon_i
\]
where $\epsilon_i \sim \mathcal{N}(0,20^2)$.


```{r}
n<-100
m<-20
set.seed(237)
x1<-runif(n,0,100)
y<-2.0+3.0*x1+rnorm(n)*20
x1new<-runif(m,0,100)
ynew<-2.0+3.0*x1new+rnorm(m)*20
par(mar=c(4,4,0,0))
plot(x1,y,pch=19,cex=0.75,xlim=range(c(x1,x1new)),ylim=range(c(y,ynew)))
points(x1new,ynew,pch=1,cex=0.75)
legend(0,max(c(y,ynew)),c('Batch 1','Batch 2'),pch=c(19,1))
```


However, suppose that we fit the model only on the first batch: we may compute the line of best fit, and the predict using a confidence interval and a prediction interval calculation based on the first 100 data points.  Here we use $\alpha=0.05$, and construct 95 \% intervals:


```{r}
fit1<-lm(y~x1)
x1new.vec<-seq(0,100,by=0.1)
conf.interval<-predict(fit1,newdata=data.frame(x1=x1new.vec),interval='confidence')
pred.interval<-predict(fit1,newdata=data.frame(x1=x1new.vec),interval='prediction')
par(mar=c(4,4,0,0))
plot(x1,y,pch=19,cex=0.75,xlim=range(c(x1,x1new)),ylim=range(c(y,ynew)))
lines(x1new.vec,conf.interval[,1],col='red')
lines(x1new.vec,conf.interval[,2],col='red',lty=2)
lines(x1new.vec,conf.interval[,3],col='red',lty=2)
lines(x1new.vec,pred.interval[,2],col='blue',lty=2)
lines(x1new.vec,pred.interval[,3],col='blue',lty=2)
legend(0,max(c(y,ynew)),
       c('Point prediction','Confidence interval','Prediction interval'),
       lty=c(1,2,2),col=c('red','red','blue'))
conf.interval[1:5,]  #Confidence interval
pred.interval[1:5,]  #Prediction interval
```


The red dashed lines reflect the uncertainty in where the `true' straight line lies, whereas the blue dashed lines indicate the uncertainty in where future observed responses would lie if a collection of new observations were made.  However here, we can compare the intervals with the second batch of observed, but not used, data.


```{r}
par(mar=c(4,4,0,0))
plot(x1new,ynew,pch=1,cex=0.75,xlim=range(c(x1,x1new)),ylim=range(c(y,ynew)))
lines(x1new.vec,conf.interval[,1],col='red')
lines(x1new.vec,conf.interval[,2],col='red',lty=2)
lines(x1new.vec,conf.interval[,3],col='red',lty=2)
lines(x1new.vec,pred.interval[,2],col='blue',lty=2)
lines(x1new.vec,pred.interval[,3],col='blue',lty=2)
legend(0,max(c(y,ynew)),
       c('Point prediction','Confidence interval','Prediction interval'),
       lty=c(1,2,2),col=c('red','red','blue'))
```


Our prediction interval is constructed such that, if the model is correct, 95\% of all `new' observations will lie within the reported interval.  In this simulation, with random number generator seed set using the command \texttt{set.seed(237)}, 19 out of the 20 new points lie within the interval.