Abstract

The Tweedie GLM is a widely used method for predicting insurance premiums. However, the structure of the logarithmic mean is restricted to a linear form in the Tweedie GLM, which can be too rigid for many applications. As a better alternative, we propose a gradient tree-boosting algorithm and apply it to Tweedie compound Poisson models for pure premiums.

As an application, we apply our method to an auto-insurance claim data and show that the new method is superior to the existing methods in the sense that it generates more accurate premium predictions, thus helping solve the adverse selection issue. We have implemented our method in a user-friendly R package that also includes a nice visualization tool for interpreting the fitted model.

Data

One of the most important problems in insurance business is to set the premium for the customers (policyholders). To appropriately set the premiums for the insurer’s customers, one crucial task is to predict the size of actual (currently unforeseeable) claims.

An example of the data for the insurance premium prediction problems is shown below (only the first ten observations are shown here). The data set contains 10,296 driver vehicle records, each record including an individual driver’s total claim amount \((z_{i})\) in the last five years \((w_{i}=5)\) and 17 characteristics \(x_{i}=(x_{i,1},\ldots,x_{i,17})\) for the driver and the insured vehicle. We want to predict the expected pure premium based on \(x_{i}\).

out <- AutoClaim[1:10,]
pander(out)
Table continues below
POLICYNO PLCYDATE CLM_FREQ5 CLM_AMT5 CLM_AMT KIDSDRIV TRAVTIME
00000160 1997-11-08 2 4461 0 0 14
00024836 1998-07-29 0 0 0 0 22
00028046 1995-10-28 0 0 0 0 26
00028960 1994-06-11 2 38690 0 0 5
00040933 1993-09-20 0 0 0 0 32
00055277 1997-09-08 2 19217 0 0 36
00063212 1995-01-18 0 0 2946 0 46
00069651 1993-07-11 0 0 0 0 33
00088070 1997-08-04 1 3295 6477 1 21
00093553 1994-05-06 0 0 0 0 30
Table continues below
CAR_USE BLUEBOOK RETAINED NPOLICY CAR_TYPE RED_CAR REVOLKED
Private 14230 11 1 Sedan yes No
Commercial 14940 1 1 Sedan yes No
Private 21970 1 1 Van yes No
Private 4010 4 1 SUV no No
Private 15440 7 1 Sedan yes No
Private 18000 1 1 SUV no Yes
Commercial 17430 1 1 Sports Car no No
Private 8780 1 1 SUV no No
Private 18930 6 1 Sedan no No
Commercial 5900 10 1 SUV no No
Table continues below
MVR_PTS CLM_FLAG AGE HOMEKIDS YOJ INCOME GENDER MARRIED
3 No 60 0 11 67349 M No
0 No 43 0 11 91449 M No
2 No 48 0 11 52881 M No
3 No 35 1 10 16039 F Yes
0 No 51 0 14 NA M Yes
3 No 50 0 NA 114986 F Yes
0 Yes 34 1 12 125301 F No
0 No 54 0 NA 18755 F Yes
2 Yes 40 1 11 50815 M No
0 No 44 2 12 43486 F No
PARENT1 JOBCLASS MAX_EDUC HOME_VAL SAMEHOME AREA IN_YY
No Professional PhD 0 18 Urban FALSE
No Blue Collar High School 257252 1 Urban TRUE
No Manager Bachelors 0 10 Urban TRUE
No Clerical High School 124191 10 Urban FALSE
No Blue Collar <High School 306251 6 Urban FALSE
No Doctor PhD 243925 17 Urban TRUE
Yes Blue Collar Bachelors 0 7 Urban TRUE
No Blue Collar <High School NA 1 Urban TRUE
Yes Manager High School 0 1 Urban FALSE
Yes Blue Collar High School 0 10 Rural FALSE
#knitr::kable(AutoClaim[1:10,], caption = "Data from Yip and Yau 2005", format = 'markdown')

One difficulty in modeling the claims (CLM_AMT5) is that the distribution is usually highly right-skewed, mixed with a point mass at zero. Such type of data cannot be transformed to normality by power transformation, and special treatment on zero claims is often required. The histogram of the total claim amounts below shows that the empirical distribution of these values is highly skewed. We find that approximately \(61.1\%\) of policyholders had no claims, and approximately \(29.6\%\) of the policyholders had a positive claim amount up to 10,000 dollars.

da <- AutoClaim # use data in the Yip and Yau paper
da <- transform(da, CLM_AMT5 = CLM_AMT5/5000,   INCOME = INCOME/10000)
hist(da$CLM_AMT5, col ="grey", breaks = 50, main  = "Total Insurance Claim Amount (in 1000 dollar) Per Policy Year",cex.lab=1.3,cex.axis=1.3, ylim = c(0,7500),xlab="Claim Amount")

Model

In this article, we aim to model the insurance claim size by a nonparametric Tweedie compound Poisson model and propose a gradient tree-boosting algorithm (TDboost henceforth) to fit this model. We also implemented the proposed method as an easy-to-use R package, which is publicly available.

We estimate the following model \[F^{*}(\mathbf{x})=\underset{F\in\mathcal{F}}{\arg\!\min}\,\big\{-\ell(F(\cdot),\phi,\rho|\{y_{i},\mathbf{x}_{i},w_{i}\}_{i=1}^{n})\big\}=\underset{F\in\mathcal{F}}{\arg\!\min}\sum_{i=1}^{n}\Psi(y_{i},F(\mathbf{x}_{i})|\rho), \] where \[\Psi(y_{i},F(\mathbf{x}_{i})|\rho)=w_{i}\Bigg\{-\frac{y_{i}\exp[(1-\rho)F(\mathbf{x}_{i})]}{1-\rho}+\frac{\exp[(2-\rho)F(\mathbf{x}_{i})]}{2-\rho}\Bigg\}.\]

Algorithm

We estimate the predictor function \(F(\cdot)\) by integrating the boosted Tweedie model into the tree-based gradient boosting algorithm, and apply the forward stagewise algorithm described in the following table for solving the model,


  1. Initialize \(\hat{F}^{[0]}\) \[\hat{F}^{[0]}=\log\Bigg(\frac{\sum_{i=1}^{n}w_{i}y_{i}}{\sum_{i=1}^{n}w_{i}}\Bigg).\]
  2. For \(m=1,\ldots,M\) repeatedly do steps 2.(a)–2.(d)
    • 2.(a) Compute the negative gradient \((u_{1}^{[m]},\ldots,u_{n}^{[m]})^{T}\) \[u_{i}^{[m]}=w_{i}\big\{-y_{i}\exp[(1-\rho)\hat{F}^{[m-1]}(\mathbf{x}_{i})]+\exp[(2-\rho)\hat{F}^{[m-1]}(\mathbf{x}_{i})]\big\}\qquad i=1,\ldots,n.\]
    • 2.(b) Fit the negative gradient vector \((u_{1}^{[m]},\ldots,u_{n}^{[m]})^{T}\) to \((\mathbf{x}_{1},\ldots,\mathbf{x}_{n})^{T}\) by an \(L\)-terminal node regression tree, where \(\mathbf{x}_{i}=(x_{i1},\ldots, x_{ip})^{T}\) for \(i=1,\ldots,n\), giving us the partitions \(\{\widehat{R}_{l}^{[m]}\}_{l=1}^{L}\).
    • 2.(c) Compute the optimal terminal node predictions \(\eta_{l}^{[m]}\) for each region \(\widehat{R}_{l}^{[m]}\), \(l=1,2,\ldots,L\) \[\hat{\eta}_{l}^{[m]}=\log\Bigg\{\frac{\sum_{i:\mathbf{x}_{i}\in\widehat{R}_{l}^{[m]}}w_{i}y_{i}\exp[(1-\rho)\hat{F}^{[m-1]}(\mathbf{x}_{i})]}{\sum_{i:\mathbf{x}_{i}\in\widehat{R}_{l}^{[m]}}w_{i}\exp[(2-\rho)\hat{F}^{[m-1]}(\mathbf{x}_{i})]}\Bigg\}.\]
    • 2.(d) Update \(\hat{F}^{[m]}(\mathbf{x})\) for each region \(\widehat{R}_{l}^{[m]}\), \(l=1,2,\ldots,L\). \[\hat{F}^{[m]}(\mathbf{x})=\hat{F}^{[m-1]}(\mathbf{x})+\nu\hat{\eta}_{l}^{[m]}I(\mathbf{x}\in\widehat{R}_{l}^{[m]})\qquad l=1,2,\ldots,L.\]
  3. Report \(\hat{F}^{[M]}(\mathbf{x})\) as the final estimate.

Simulation

In this simulation study, we demonstrate that TDboost is well suited to fit target functions that are non-linear or involve complex interactions. We consider the true target function with two hills and two valleys: \[F(x_1,x_2)=e^{-5(1-x_1)^2+x_2^2}+e^{-5x_1^2+(1-x_2)^2},\] which corresponds to a common scenario where the effect of one variable changes depending on the effect of another. We assume \(x_1,x_2\sim \mathrm{Unif}(0,1)\), and \(y\sim\mathrm{Tw}(\mu,\phi,\rho)\) with \(\rho=1.5\) and \(\phi=0.5\). We generate \(n=1000\) observations for training and \(n^{\prime}=1000\) for testing, and fit the training data using TDboost, MGCV, and TGLM. The fitted functions from this model are plotted below. We find that TDboost outperforms TGLM and MGCV in terms of the ability to recover the true functions and gives the smallest prediction errors.

h2 = function(x) {
    exp( -5 * (1-x[,1])^2 + (x[,2])^2 ) + exp( -5 * (x[,1])^2 + (1-x[,2])^2 )
}

T1 = seq(0,1,0.03)
T2 = seq(0,1,0.03)
X = data.frame(expand.grid(T1,T2))
colnames(X) <- c("X1","X2")
hx = h2(X)
wireframe(hx~X1+X2,scales=list(arrows=F,col=1), data = X,shade=TRUE,
par.settings = list(axis.line = list(col = "transparent")),
xlab = "x1",
ylab = "x2",
zlab = "F(s1,x2)", cex.lab=1.2, 
main="(a) True")

## TDboost plot
wireframe(pred_f2~V1+V2,
data=test_dat, scales=list(arrows=F,col=1), 
par.settings = list(axis.line = list(col = "transparent")),
xlab = "x1",
ylab = "x2",
zlab = "F(x1,x2)",
cex.lab=1.2, 
main="(b) TDboost", shade=TRUE
)

## GLM plot
wireframe(pred_f3~V1+V2,
data=test_dat, scales=list(arrows=F,col=1), 
par.settings = list(axis.line = list(col = "transparent")),
xlab = "x1",
ylab = "x2",
zlab = "F(x1,x2)",
cex.lab=1.2, 
main="(c) TGLM", shade=TRUE
)

## MGCV plot
wireframe(pred_f1~V1+V2,
data=test_dat, scales=list(arrows=F,col=1), 
par.settings = list(axis.line = list(col = "transparent")),
xlab = "x1",
ylab = "x2",
cex.lab=1.2, 
zlab = "F(x1,x2)",
main="(d) MGCV", shade=TRUE,
light.source = c(-10,0,10)
)

Results

To examine the performance of TGLM, MGCV and TDboost, after fitting on the training set, we predict the pure premium \(P(\mathbf{x})=\hat{\mu}(\mathbf{x})\) by applying each model on the independent held-out testing set.

Performance comparison

Following Frees et al. (2014), we successively specify the prediction from each model as the base premium \(B(\mathbf{x})\) and use predictions from the remaining models as the competing premium \(P(\mathbf{x})\) to compute the Gini indices. The figure below shows that when TGLM (or MGCV) is selected as the base premium, the area between the line of equality and the ordered Lorenz curve is larger when choosing TDboost as the competing premium, indicating again that the TDboost model represents the most favorable choice.

pp <- ggplot(pd, aes(Premium, Loss))
pp <- pp + geom_line(aes(linetype = Model)) + 
       geom_line(aes(Premium, Premium, linetype = Base))
pp <- pp + facet_wrap('Base') 
none <- element_blank()
pp <- pp + theme(panel.background = element_rect(fill='white', colour='black'),
legend.key = none)
print(pp)

Variable importance

There are several explanatory variables significantly related to the pure premium. The Variable Importance (VI) measure and the baseline value of each explanatory variable are shown in Figure 8. We find that REVOKED, MVR PTS, AREA and BLUEBOOK have high VI measure scores (the vertical line), and their scores all surpass the corresponding baselines (the horizontal line-length), indicating that the impor- tance of those explanatory variables is real. We also find the variables AGE, JOBCLASS, CAR TYPE, NPOLICY, MAX EDUC, MARRIED, KIDSDRIV and CAR USE have larger-than-baseline VI measure scores, but the absolute scales are much less than aforementioned four variables. On the other hand, although the VI measure of, e.g., TRAVTIME is quite large, it does not significantly surpass the baseline importance.

ltheme <- canonical.theme(color = FALSE) ## in-built B&W theme 
ltheme$strip.background$col <- "transparent" ## change strip bg 
lattice.options(default.theme = ltheme) ## set as default
my.key = list(space="top", border=TRUE, padding.text=7, columns = 2,
# points=list(pch = list(16,2),col=c(4,1),cex = c(1,1)), 
lines = list(pch=c("l",NA), type=c("p","l"), col=c(1,1), lty = c(NA, 1), lwd = c(1,4) ), 
text=list(c("Relative Influence","Baseline")), cex = 1)
panelfun = function(x,y,subscripts,...){
panel.dotplot(x, y, pch="l", cex=1, col = 1)
panel.segments(0, as.numeric(y), out1$baseline[subscripts], as.numeric(y), lty=1, col = 1, lwd=4)
}
pl <- dotplot(reorder(var,rel.inf) ~ rel.inf, groups = w, data = out1, 
layout = c(1,1), key = my.key, 
xlab = "Fraction of Reduction in Sum of Squared Error in Gradient Prediction", 
panel = panel.superpose, 
panel.groups = panelfun
)
print(pl)

Partial dependence

We now use the partial dependence plots to visualize the fitted model. The plot below shows the main effects of four important explanatory variables on the pure premium. We clearly see that the strong nonlinear effects exist in predictors BLUEBOOK and MVR PTS: for the policyholders whose vehicle values are below 40K, their pure premium is negatively as- sociated with the value of vehicle; after the value of vehicle passes 40K, the pure premium curve reaches a plateau; Additionally, the pure premium is positively associated with motor vehicle record points MVR PTS, but the pure premium curve reaches a plateau when MVR PTS exceeds six. On the other hand, the partial dependence plots suggest that a policyholder who lives in the urban area (AREA=“URBAN”) or with driver’s license revoked (REVOKED=“YES”) typically has relatively high pure premium.

pl <- xyplot(y ~ x | var, data = mat,
    type = "l",
    ylab = "Pure Premium (in $1000)",
    xlab = "$x$",
    lty=1:3,
    scale ="free",
    lwd=1.5)

pl1 = barchart(y ~ REVOKED, data = out1, 
    beside = TRUE, strip=strip.custom(factor.levels="a"),
    col = grey(c(0.5,1)))

pl2 = barchart(y ~ AREA, data = out2, 
    beside = TRUE, strip=strip.custom(factor.levels="a"),
    col = grey(c(0.5,1)))

print(c(pl,REVOKED = pl1,AREA=pl2,merge.legends=TRUE))

Higher order interactions

In the plot below, we visualize the effects of four important second order interactions using the joint partial dependence plots. These four interactions are AREA × MVR PTS, AREA × NPOLICY, AREA × REVOKED and AREA × TRAVTIME. These four interactions all involve the variable AREA: we can see that the marginal effects of MVR PTS, NPOLICY, REVOKED and TRAVTIME on the pure premium are greater for the policyholders living in the urban area (AREA=“URBAN”) than those living in the rural area (AREA=“RURAL”). For example, a strong AREA × MVR PTS interaction suggests that for the policyholders living in the rural area, motor vehicle record points of the policyholders have a weaker positive marginal effect on the expected pure premium than for the policyholders living in the urban area.

load("partial_image1.rda")

wireframe(marginal.effect~AREA+MVR_PTS,
scales = list(arrows = FALSE, x = list(at = c(1.25, 1.75), lab = c('Rural', 'Urban'))),
data=tmp1,screen = list(z = 30, x = -60), main="(a)",
zlab="mu(x)",ylab="MVR_PTS",
par.settings = list(axis.line = list(col = "transparent")),
lwd=0.6
)

wireframe(marginal.effect~AREA+NPOLICY,
data=tmp2,screen = list(z = 120, x = -60), main="(b)",
scales = list(arrows = FALSE, x = list(at = c(1.25, 1.75), lab = c('Rural', 'Urban'))),
zlab="mu(x)",
par.settings = list(axis.line = list(col = "transparent")),
lwd=0.6
)

wireframe(marginal.effect~AREA+REVOKED,
data=tmp3,screen = list(z = 30, x = -60), main="(c)",
scales = list(arrows = FALSE, x = list(at = c(1.25, 1.75), lab = c('Rural', 'Urban')),
y = list(at = c(1.25, 1.75), lab = c('No', 'Yes'))
),
zlab="mu(x)",
par.settings = list(axis.line = list(col = "transparent")),
lwd=0.6
)

wireframe(marginal.effect~AREA+TRAVTIME,
data=tmp4,screen = list(z = 120, x = -60), main="(d)",
scales = list(arrows = FALSE, x = list(at = c(1.25, 1.75), lab = c('Rural', 'Urban'))),
zlab="mu(x)",
par.settings = list(axis.line = list(col = "transparent")),
lwd=0.6
)