--- title: "Prediction and Prediction Variability" output: pdf_document --- We have seen in lectures that when considering the variability of predictions for new $x$ values, we may wish to account for future residual errors in the calculation. Specifically, our model \[ Y_i = \beta_0 + \beta_1 x_{i1} + \epsilon_i \] is regarded as relating to \textbf{all} outcome data we will observe, whether they be part of the original sample based on predictor values $x_{i1},i=1,\ldots,n$ or part of the `new' sample with predictor values $x_{i1}^\text{new},i=1,\ldots,m$. We noted the relationship between a predicted value at $x_{i1}^\text{new}$ \textit{\textbf{without}} residual error, $\widehat Y_{i}^\text{new}$, and the prediction \textit{\textbf{with}} residual error, $\widehat Y_{\text{O}i}^\text{new}$, as \[ \widehat Y_{i}^\text{new} = \widehat \beta_0 + \widehat \beta_1 x_{i1}^\text{new} \] with estimators $(\widehat \beta_0, \widehat \beta_1)$, whereas \begin{align*} \widehat Y_{\text{O}i}^\text{new} & = \widehat Y_{i}^\text{new} + \epsilon_i^\text{new} \\[6pt] & = \widehat \beta_0 + \widehat \beta_1 x_{i1}^\text{new} + \epsilon_i^\text{new}. \end{align*} In both cases, the point prediction is simply $\widehat y_i^\text{new} = \textbf{x}_i^\text{new} \widehat \beta$, but the associated uncertainty intervals are different: for a $(1-\alpha) 100 \%$ interval, we have \[ \begin{array}{rcl} \textrm{Confidence interval} & : & \widehat y_i^\text{new} \pm t_{\alpha/2,n-2} \sqrt{\left(\widehat \sigma^2 \textbf{H}^\text{new}\right)_{ii}} \\[6pt] \textrm{Prediction interval} & : & \widehat y_i^\text{new} \pm t_{\alpha/2,n-2} \sqrt{\widehat \sigma^2 \left(\textbf{I}_m + \textbf{H}^\text{new}\right)_{ii}} \end{array} \] where \[ \textbf{H}^\text{new} = \textbf{X}^\text{new} (\textbf{X}^\top \textbf{X})^{-1} \left\{ \textbf{X}^\text{new} \right\}^\top. \] To illustrate the difference between a confidence interval and a prediction interval, consider the following experiment. We observe $n=100$ data points in an initial data batch, and then $m=20$ data points subsequently, with all observations independent. Thus we have a total of 120 observations. The data are simulated using the model \[ Y_i = 2 + 3 x_{i1} + \epsilon_i \] where $\epsilon_i \sim \mathcal{N}(0,20^2)$. ```{r} n<-100 m<-20 set.seed(237) x1<-runif(n,0,100) y<-2.0+3.0*x1+rnorm(n)*20 x1new<-runif(m,0,100) ynew<-2.0+3.0*x1new+rnorm(m)*20 par(mar=c(4,4,0,0)) plot(x1,y,pch=19,cex=0.75,xlim=range(c(x1,x1new)),ylim=range(c(y,ynew))) points(x1new,ynew,pch=1,cex=0.75) legend(0,max(c(y,ynew)),c('Batch 1','Batch 2'),pch=c(19,1)) ``` However, suppose that we fit the model only on the first batch: we may compute the line of best fit, and the predict using a confidence interval and a prediction interval calculation based on the first 100 data points. Here we use $\alpha=0.05$, and construct 95 \% intervals: ```{r} fit1<-lm(y~x1) x1new.vec<-seq(0,100,by=0.1) conf.interval<-predict(fit1,newdata=data.frame(x1=x1new.vec),interval='confidence') pred.interval<-predict(fit1,newdata=data.frame(x1=x1new.vec),interval='prediction') par(mar=c(4,4,0,0)) plot(x1,y,pch=19,cex=0.75,xlim=range(c(x1,x1new)),ylim=range(c(y,ynew))) lines(x1new.vec,conf.interval[,1],col='red') lines(x1new.vec,conf.interval[,2],col='red',lty=2) lines(x1new.vec,conf.interval[,3],col='red',lty=2) lines(x1new.vec,pred.interval[,2],col='blue',lty=2) lines(x1new.vec,pred.interval[,3],col='blue',lty=2) legend(0,max(c(y,ynew)), c('Point prediction','Confidence interval','Prediction interval'), lty=c(1,2,2),col=c('red','red','blue')) conf.interval[1:5,] #Confidence interval pred.interval[1:5,] #Prediction interval ``` The red dashed lines reflect the uncertainty in where the `true' straight line lies, whereas the blue dashed lines indicate the uncertainty in where future observed responses would lie if a collection of new observations were made. However here, we can compare the intervals with the second batch of observed, but not used, data. ```{r} par(mar=c(4,4,0,0)) plot(x1new,ynew,pch=1,cex=0.75,xlim=range(c(x1,x1new)),ylim=range(c(y,ynew))) lines(x1new.vec,conf.interval[,1],col='red') lines(x1new.vec,conf.interval[,2],col='red',lty=2) lines(x1new.vec,conf.interval[,3],col='red',lty=2) lines(x1new.vec,pred.interval[,2],col='blue',lty=2) lines(x1new.vec,pred.interval[,3],col='blue',lty=2) legend(0,max(c(y,ynew)), c('Point prediction','Confidence interval','Prediction interval'), lty=c(1,2,2),col=c('red','red','blue')) ``` Our prediction interval is constructed such that, if the model is correct, 95\% of all `new' observations will lie within the reported interval. In this simulation, with random number generator seed set using the command \texttt{set.seed(237)}, 19 out of the 20 new points lie within the interval.