Abstract

The Tweedie GLM is a widely used method for predicting insurance premiums. However, the structure of the logarithmic mean is restricted to a linear form in the Tweedie GLM, which can be too rigid for many applications. As a better alternative, we propose a gradient tree-boosting algorithm and apply it to Tweedie compound Poisson models for pure premiums.

As an application, we apply our method to an auto-insurance claim data and show that the new method is superior to the existing methods in the sense that it generates more accurate premium predictions, thus helping solve the adverse selection issue. We have implemented our method in a user-friendly R package that also includes a nice visualization tool for interpreting the fitted model.

Data

One of the most important problems in insurance business is to set the premium for the customers (policyholders). To appropriately set the premiums for the insurer’s customers, one crucial task is to predict the size of actual (currently unforeseeable) claims.

An example of the data for the insurance premium prediction problems is shown below (only the first ten observations are shown here). The data set contains 10,296 driver vehicle records, each record including an individual driver’s total claim amount \((z_{i})\) in the last five years \((w_{i}=5)\) and 17 characteristics \(x_{i}=(x_{i,1},\ldots,x_{i,17})\) for the driver and the insured vehicle. We want to predict the expected pure premium based on \(x_{i}\).

out <- AutoClaim[1:10,]
pander(out)

Table continues below
POLICYNO	PLCYDATE	CLM_FREQ5	CLM_AMT5	CLM_AMT	KIDSDRIV	TRAVTIME
00000160	1997-11-08	2	4461	0	0	14
00024836	1998-07-29	0	0	0	0	22
00028046	1995-10-28	0	0	0	0	26
00028960	1994-06-11	2	38690	0	0	5
00040933	1993-09-20	0	0	0	0	32
00055277	1997-09-08	2	19217	0	0	36
00063212	1995-01-18	0	0	2946	0	46
00069651	1993-07-11	0	0	0	0	33
00088070	1997-08-04	1	3295	6477	1	21
00093553	1994-05-06	0	0	0	0	30

Table continues below
CAR_USE	BLUEBOOK	RETAINED	NPOLICY	CAR_TYPE	RED_CAR	REVOLKED
Private	14230	11	1	Sedan	yes	No
Commercial	14940	1	1	Sedan	yes	No
Private	21970	1	1	Van	yes	No
Private	4010	4	1	SUV	no	No
Private	15440	7	1	Sedan	yes	No
Private	18000	1	1	SUV	no	Yes
Commercial	17430	1	1	Sports Car	no	No
Private	8780	1	1	SUV	no	No
Private	18930	6	1	Sedan	no	No
Commercial	5900	10	1	SUV	no	No

Table continues below
MVR_PTS	CLM_FLAG	AGE	HOMEKIDS	YOJ	INCOME	GENDER	MARRIED
3	No	60	0	11	67349	M	No
0	No	43	0	11	91449	M	No
2	No	48	0	11	52881	M	No
3	No	35	1	10	16039	F	Yes
0	No	51	0	14	NA	M	Yes
3	No	50	0	NA	114986	F	Yes
0	Yes	34	1	12	125301	F	No
0	No	54	0	NA	18755	F	Yes
2	Yes	40	1	11	50815	M	No
0	No	44	2	12	43486	F	No

PARENT1	JOBCLASS	MAX_EDUC	HOME_VAL	SAMEHOME	AREA	IN_YY
No	Professional	PhD	0	18	Urban	FALSE
No	Blue Collar	High School	257252	1	Urban	TRUE
No	Manager	Bachelors	0	10	Urban	TRUE
No	Clerical	High School	124191	10	Urban	FALSE
No	Blue Collar	<High School	306251	6	Urban	FALSE
No	Doctor	PhD	243925	17	Urban	TRUE
Yes	Blue Collar	Bachelors	0	7	Urban	TRUE
No	Blue Collar	<High School	NA	1	Urban	TRUE
Yes	Manager	High School	0	1	Urban	FALSE
Yes	Blue Collar	High School	0	10	Rural	FALSE

#knitr::kable(AutoClaim[1:10,], caption = "Data from Yip and Yau 2005", format = 'markdown')

One difficulty in modeling the claims (CLM_AMT5) is that the distribution is usually highly right-skewed, mixed with a point mass at zero. Such type of data cannot be transformed to normality by power transformation, and special treatment on zero claims is often required. The histogram of the total claim amounts below shows that the empirical distribution of these values is highly skewed. We find that approximately \(61.1\%\) of policyholders had no claims, and approximately \(29.6\%\) of the policyholders had a positive claim amount up to 10,000 dollars.

da <- AutoClaim # use data in the Yip and Yau paper
da <- transform(da, CLM_AMT5 = CLM_AMT5/5000,   INCOME = INCOME/10000)
hist(da$CLM_AMT5, col ="grey", breaks = 50, main  = "Total Insurance Claim Amount (in 1000 dollar) Per Policy Year",cex.lab=1.3,cex.axis=1.3, ylim = c(0,7500),xlab="Claim Amount")

Model

In this article, we aim to model the insurance claim size by a nonparametric Tweedie compound Poisson model and propose a gradient tree-boosting algorithm (TDboost henceforth) to fit this model. We also implemented the proposed method as an easy-to-use R package, which is publicly available.

We estimate the following model \[F^{*}(\mathbf{x})=\underset{F\in\mathcal{F}}{\arg\!\min}\,\big\{-\ell(F(\cdot),\phi,\rho|\{y_{i},\mathbf{x}_{i},w_{i}\}_{i=1}^{n})\big\}=\underset{F\in\mathcal{F}}{\arg\!\min}\sum_{i=1}^{n}\Psi(y_{i},F(\mathbf{x}_{i})|\rho), \] where \[\Psi(y_{i},F(\mathbf{x}_{i})|\rho)=w_{i}\Bigg\{-\frac{y_{i}\exp[(1-\rho)F(\mathbf{x}_{i})]}{1-\rho}+\frac{\exp[(2-\rho)F(\mathbf{x}_{i})]}{2-\rho}\Bigg\}.\]

Algorithm

We estimate the predictor function \(F(\cdot)\) by integrating the boosted Tweedie model into the tree-based gradient boosting algorithm, and apply the forward stagewise algorithm described in the following table for solving the model,

Initialize \(\hat{F}^{[0]}\) \[\hat{F}^{[0]}=\log\Bigg(\frac{\sum_{i=1}^{n}w_{i}y_{i}}{\sum_{i=1}^{n}w_{i}}\Bigg).\]
For \(m=1,\ldots,M\) repeatedly do steps 2.(a)–2.(d)
- 2.(a) Compute the negative gradient \((u_{1}^{[m]},\ldots,u_{n}^{[m]})^{T}\) \[u_{i}^{[m]}=w_{i}\big\{-y_{i}\exp[(1-\rho)\hat{F}^{[m-1]}(\mathbf{x}_{i})]+\exp[(2-\rho)\hat{F}^{[m-1]}(\mathbf{x}_{i})]\big\}\qquad i=1,\ldots,n.\]
- 2.(b) Fit the negative gradient vector \((u_{1}^{[m]},\ldots,u_{n}^{[m]})^{T}\) to \((\mathbf{x}_{1},\ldots,\mathbf{x}_{n})^{T}\) by an \(L\)-terminal node regression tree, where \(\mathbf{x}_{i}=(x_{i1},\ldots, x_{ip})^{T}\) for \(i=1,\ldots,n\), giving us the partitions \(\{\widehat{R}_{l}^{[m]}\}_{l=1}^{L}\).
- 2.(c) Compute the optimal terminal node predictions \(\eta_{l}^{[m]}\) for each region \(\widehat{R}_{l}^{[m]}\), \(l=1,2,\ldots,L\) \[\hat{\eta}_{l}^{[m]}=\log\Bigg\{\frac{\sum_{i:\mathbf{x}_{i}\in\widehat{R}_{l}^{[m]}}w_{i}y_{i}\exp[(1-\rho)\hat{F}^{[m-1]}(\mathbf{x}_{i})]}{\sum_{i:\mathbf{x}_{i}\in\widehat{R}_{l}^{[m]}}w_{i}\exp[(2-\rho)\hat{F}^{[m-1]}(\mathbf{x}_{i})]}\Bigg\}.\]
- 2.(d) Update \(\hat{F}^{[m]}(\mathbf{x})\) for each region \(\widehat{R}_{l}^{[m]}\), \(l=1,2,\ldots,L\). \[\hat{F}^{[m]}(\mathbf{x})=\hat{F}^{[m-1]}(\mathbf{x})+\nu\hat{\eta}_{l}^{[m]}I(\mathbf{x}\in\widehat{R}_{l}^{[m]})\qquad l=1,2,\ldots,L.\]
Report \(\hat{F}^{[M]}(\mathbf{x})\) as the final estimate.

Simulation

In this simulation study, we demonstrate that TDboost is well suited to fit target functions that are non-linear or involve complex interactions. We consider the true target function with two hills and two valleys: \[F(x_1,x_2)=e^{-5(1-x_1)^2+x_2^2}+e^{-5x_1^2+(1-x_2)^2},\] which corresponds to a common scenario where the effect of one variable changes depending on the effect of another. We assume \(x_1,x_2\sim \mathrm{Unif}(0,1)\), and \(y\sim\mathrm{Tw}(\mu,\phi,\rho)\) with \(\rho=1.5\) and \(\phi=0.5\). We generate \(n=1000\) observations for training and \(n^{\prime}=1000\) for testing, and fit the training data using TDboost, MGCV, and TGLM. The fitted functions from this model are plotted below. We find that TDboost outperforms TGLM and MGCV in terms of the ability to recover the true functions and gives the smallest prediction errors.

h2 = function(x) {
    exp( -5 * (1-x[,1])^2 + (x[,2])^2 ) + exp( -5 * (x[,1])^2 + (1-x[,2])^2 )
}

T1 = seq(0,1,0.03)
T2 = seq(0,1,0.03)
X = data.frame(expand.grid(T1,T2))
colnames(X) <- c("X1","X2")
hx = h2(X)
wireframe(hx~X1+X2,scales=list(arrows=F,col=1), data = X,shade=TRUE,
par.settings = list(axis.line = list(col = "transparent")),
xlab = "x1",
ylab = "x2",
zlab = "F(s1,x2)", cex.lab=1.2, 
main="(a) True")

## TDboost plot
wireframe(pred_f2~V1+V2,
data=test_dat, scales=list(arrows=F,col=1), 
par.settings = list(axis.line = list(col = "transparent")),
xlab = "x1",
ylab = "x2",
zlab = "F(x1,x2)",
cex.lab=1.2, 
main="(b) TDboost", shade=TRUE
)

## GLM plot
wireframe(pred_f3~V1+V2,
data=test_dat, scales=list(arrows=F,col=1), 
par.settings = list(axis.line = list(col = "transparent")),
xlab = "x1",
ylab = "x2",
zlab = "F(x1,x2)",
cex.lab=1.2, 
main="(c) TGLM", shade=TRUE
)

## MGCV plot
wireframe(pred_f1~V1+V2,
data=test_dat, scales=list(arrows=F,col=1), 
par.settings = list(axis.line = list(col = "transparent")),
xlab = "x1",
ylab = "x2",
cex.lab=1.2, 
zlab = "F(x1,x2)",
main="(d) MGCV", shade=TRUE,
light.source = c(-10,0,10)
)

Results

To examine the performance of TGLM, MGCV and TDboost, after fitting on the training set, we predict the pure premium \(P(\mathbf{x})=\hat{\mu}(\mathbf{x})\) by applying each model on the independent held-out testing set.

Performance comparison

Following Frees et al. (2014), we successively specify the prediction from each model as the base premium \(B(\mathbf{x})\) and use predictions from the remaining models as the competing premium \(P(\mathbf{x})\) to compute the Gini indices. The figure below shows that when TGLM (or MGCV) is selected as the base premium, the area between the line of equality and the ordered Lorenz curve is larger when choosing TDboost as the competing premium, indicating again that the TDboost model represents the most favorable choice.

pp <- ggplot(pd, aes(Premium, Loss))
pp <- pp + geom_line(aes(linetype = Model)) + 
       geom_line(aes(Premium, Premium, linetype = Base))
pp <- pp + facet_wrap('Base') 
none <- element_blank()
pp <- pp + theme(panel.background = element_rect(fill='white', colour='black'),
legend.key = none)
print(pp)

Variable importance

There are several explanatory variables significantly related to the pure premium. The Variable Importance (VI) measure and the baseline value of each explanatory variable are shown in Figure 8. We find that REVOKED, MVR PTS, AREA and BLUEBOOK have high VI measure scores (the vertical line), and their scores all surpass the corresponding baselines (the horizontal line-length), indicating that the impor- tance of those explanatory variables is real. We also find the variables AGE, JOBCLASS, CAR TYPE, NPOLICY, MAX EDUC, MARRIED, KIDSDRIV and CAR USE have larger-than-baseline VI measure scores, but the absolute scales are much less than aforementioned four variables. On the other hand, although the VI measure of, e.g., TRAVTIME is quite large, it does not significantly surpass the baseline importance.

ltheme <- canonical.theme(color = FALSE) ## in-built B&W theme 
ltheme$strip.background$col <- "transparent" ## change strip bg 
lattice.options(default.theme = ltheme) ## set as default
my.key = list(space="top", border=TRUE, padding.text=7, columns = 2,
# points=list(pch = list(16,2),col=c(4,1),cex = c(1,1)), 
lines = list(pch=c("l",NA), type=c("p","l"), col=c(1,1), lty = c(NA, 1), lwd = c(1,4) ), 
text=list(c("Relative Influence","Baseline")), cex = 1)
panelfun = function(x,y,subscripts,...){
panel.dotplot(x, y, pch="l", cex=1, col = 1)
panel.segments(0, as.numeric(y), out1$baseline[subscripts], as.numeric(y), lty=1, col = 1, lwd=4)
}
pl <- dotplot(reorder(var,rel.inf) ~ rel.inf, groups = w, data = out1, 
layout = c(1,1), key = my.key, 
xlab = "Fraction of Reduction in Sum of Squared Error in Gradient Prediction", 
panel = panel.superpose, 
panel.groups = panelfun
)
print(pl)

Partial dependence

We now use the partial dependence plots to visualize the fitted model. The plot below shows the main effects of four important explanatory variables on the pure premium. We clearly see that the strong nonlinear effects exist in predictors BLUEBOOK and MVR PTS: for the policyholders whose vehicle values are below 40K, their pure premium is negatively as- sociated with the value of vehicle; after the value of vehicle passes 40K, the pure premium curve reaches a plateau; Additionally, the pure premium is positively associated with motor vehicle record points MVR PTS, but the pure premium curve reaches a plateau when MVR PTS exceeds six. On the other hand, the partial dependence plots suggest that a policyholder who lives in the urban area (AREA=“URBAN”) or with driver’s license revoked (REVOKED=“YES”) typically has relatively high pure premium.

pl <- xyplot(y ~ x | var, data = mat,
    type = "l",
    ylab = "Pure Premium (in $1000)",
    xlab = "$x$",
    lty=1:3,
    scale ="free",
    lwd=1.5)

pl1 = barchart(y ~ REVOKED, data = out1, 
    beside = TRUE, strip=strip.custom(factor.levels="a"),
    col = grey(c(0.5,1)))

pl2 = barchart(y ~ AREA, data = out2, 
    beside = TRUE, strip=strip.custom(factor.levels="a"),
    col = grey(c(0.5,1)))

print(c(pl,REVOKED = pl1,AREA=pl2,merge.legends=TRUE))

Higher order interactions

In the plot below, we visualize the effects of four important second order interactions using the joint partial dependence plots. These four interactions are AREA × MVR PTS, AREA × NPOLICY, AREA × REVOKED and AREA × TRAVTIME. These four interactions all involve the variable AREA: we can see that the marginal effects of MVR PTS, NPOLICY, REVOKED and TRAVTIME on the pure premium are greater for the policyholders living in the urban area (AREA=“URBAN”) than those living in the rural area (AREA=“RURAL”). For example, a strong AREA × MVR PTS interaction suggests that for the policyholders living in the rural area, motor vehicle record points of the policyholders have a weaker positive marginal effect on the expected pure premium than for the policyholders living in the urban area.

load("partial_image1.rda")

wireframe(marginal.effect~AREA+MVR_PTS,
scales = list(arrows = FALSE, x = list(at = c(1.25, 1.75), lab = c('Rural', 'Urban'))),
data=tmp1,screen = list(z = 30, x = -60), main="(a)",
zlab="mu(x)",ylab="MVR_PTS",
par.settings = list(axis.line = list(col = "transparent")),
lwd=0.6
)

wireframe(marginal.effect~AREA+NPOLICY,
data=tmp2,screen = list(z = 120, x = -60), main="(b)",
scales = list(arrows = FALSE, x = list(at = c(1.25, 1.75), lab = c('Rural', 'Urban'))),
zlab="mu(x)",
par.settings = list(axis.line = list(col = "transparent")),
lwd=0.6
)

wireframe(marginal.effect~AREA+REVOKED,
data=tmp3,screen = list(z = 30, x = -60), main="(c)",
scales = list(arrows = FALSE, x = list(at = c(1.25, 1.75), lab = c('Rural', 'Urban')),
y = list(at = c(1.25, 1.75), lab = c('No', 'Yes'))
),
zlab="mu(x)",
par.settings = list(axis.line = list(col = "transparent")),
lwd=0.6
)

wireframe(marginal.effect~AREA+TRAVTIME,
data=tmp4,screen = list(z = 120, x = -60), main="(d)",
scales = list(arrows = FALSE, x = list(at = c(1.25, 1.75), lab = c('Rural', 'Urban'))),
zlab="mu(x)",
par.settings = list(axis.line = list(col = "transparent")),
lwd=0.6
)

Insurance Premium Prediction via Gradient Tree-Boosted Tweedie Compound Poisson Models