library(IRdisplay)
display_html(
'<script>
code_show=true;
function code_toggle() {
if (code_show){
$(\'div.input\').hide();
} else {
$(\'div.input\').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()">
<input type="submit" value="Click here to toggle on/off the raw code.">
</form>'
)
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
load("gltd.rda")
#load("resultsGltd.rData")
load("results_gltd.RData")
library(ggplot2)
The data-set analyzed is Group Long Term Disability without maternity records.
We removed the observations with unknown values for the variable Gross_Indexed_Benefit_Amount
The numerical variable to be predicted is Actual_Recovery_Rate, the variables used for prediction are AgeBand, Duration_12_49 and Gross_Indexed_Benefit_Amount (numerical) and Disability_Category, OwnOccToAnyTransition_MOD, Integration_with_STD and Gender (categorical).
$76\%$ of the values of Actual_Recovery_Rate are equal to $0$. A histogram of the actual recovery rate is presented next:
library(repr)
options(repr.plot.width=5, repr.plot.height=4)
library(ggplot2)
ggplot(gltd, aes(x = Actual_Recovery_Rate)) +
geom_histogram(binwidth = 0.01, fill = "cornsilk", color = "black") + ylab("Counts") + xlab("Actual Recovery Rate") +
scale_y_continuous(breaks = c(0, 200000, 400000, 600000), label = c("0", "20000", "40000", "60000"))
Moreover, we present a histogram of the variable without zeros.
library(repr)
options(repr.plot.width=5, repr.plot.height=4)
library(ggplot2)
ggplot(gltd[gltd$Actual_Recovery_Rate > 0,], aes(x = Actual_Recovery_Rate)) +
geom_histogram(binwidth = 0.01, fill = "cornsilk", color = "black") + ylab("Counts") + xlab("Actual Recovery Rate, Excluding Zero Values") +
scale_y_continuous(breaks = c(0, 200000, 400000, 600000), label = c("0", "20000", "40000", "60000"))
To evaluate the performance of the models, the data-set is split into a training set and a test set. After fitting the model to the training set, predictions are made for the test set. To evaluate the predictive performance, the mean squared error (MSE) is computed on the test set.
# Read the data
data <- read.csv(file = "data_gltd.csv")
# Select Training Data (70% of data) by sampling without replacement
set.seed(1)
Train <- sample(nrow(data), floor(0.5 * nrow(data)), replace = FALSE)
dataTrain <- data[Train, ]
dataTest <- data[-Train, ]
# Create the model matrices
x.train <- model.matrix(Actual_Recovery_Rate ~ Disability_Category + AgeBand+Duration_12_49+
OwnOccToAnyTransition_MOD + Integration_with_STD +
Gender + Gross_Indexed_Benefit_Amount, dataTrain)[,-1]
y.train <- dataTrain$Actual_Recovery_Rate
x.test <- model.matrix(Actual_Recovery_Rate ~ Disability_Category + AgeBand + Duration_12_49 + OwnOccToAnyTransition_MOD + Integration_with_STD + Gender+Gross_Indexed_Benefit_Amount, dataTest)[,-1]
Kopinsky (2017) develops a CART model for this dataset, so we will first repeat this fit.
First a full tree is grown, probably overfitting the data, then the full tree is pruned in order to minimize the test error.
The parameters used to tune the algorithm are the maximum depth of the tree, the minimum number of observations in a node (to determine whether to attempt a split at a particular node), and the value of the Cp used to determine if attempting a split.
The value of the MSE for the best model using the above procedure is 0.01171608.
library(rpart)
rpart_model <- rpart(Actual_Recovery_Rate ~ Disability_Category + AgeBand + Duration_12_49 +
OwnOccToAnyTransition_MOD + Integration_with_STD +
Gender + Gross_Indexed_Benefit_Amount,
data = dataTrain,
maxdepth=15, # maximum depth of the tree
minsplit=110, # minimum number of obs in node to attempt a split
cp=0.0000001, # Cp criterion to determine if attempting a split
method = "anova")
# Prune back model to minimize cross-validation errors
fit.pruned <- prune(rpart_model, cp = rpart_model$cptable[which.min(rpart_model$cptable[,"xerror"]),"CP"] )
rpart_predictions <- predict(fit.pruned, dataTest)
MSE_RPART <- sum((rpart_predictions - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest)
The test error for the GLM is 0.01255619, suggesting that the CART or other nonlinear models can improve the predictions for this data. However, it is possible that an alternative choice for the link function might improve the GLM fit, given the unconditional distribution of the data.
glm_model <- glm(Actual_Recovery_Rate ~ Disability_Category + AgeBand + Duration_12_49 +
OwnOccToAnyTransition_MOD + Integration_with_STD +
Gender + Gross_Indexed_Benefit_Amount,
data = dataTrain,
gaussian(link = "identity"))
glm_predictions <- predict(glm_model, newdata = dataTest, type = "response")
MSE_GLM <- sum((glm_predictions - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest)
In order to overcome this problem we first performed a classification, using a GLM, into claims with recovery rate greater than $0$ and claims with recovery rate $0$. Then, we performed a regression using the claims with recovery rate greater than $0$ in order to predict the actual recovery rate.
The test error with this new method is 0.0134502, worse than before, which suggests that a linear model is not able to capture the interaction between variables in the dataset as well as a tree-based model.
glm_model1 <- glm(as.factor(Actual_Recovery_Rate > 0) ~ Disability_Category + AgeBand + Duration_12_49 + +
OwnOccToAnyTransition_MOD + Integration_with_STD +
Gender + Gross_Indexed_Benefit_Amount,
data = dataTrain,
family = binomial(link = "logit"))
glm_model2 <- glm(log(Actual_Recovery_Rate) ~ Disability_Category + AgeBand + Duration_12_49 +
OwnOccToAnyTransition_MOD + Integration_with_STD +
Gender + Gross_Indexed_Benefit_Amount,
data = dataTrain[dataTrain$Actual_Recovery_Rate != 0,],
gaussian(link = "identity"))
glm_predictions <- predict(glm_model1,newdata = dataTest, type = "response") > 0.5
glm_predictions[glm_predictions != 0] <- exp(predict(glm_model2,newdata = dataTest[glm_predictions != 0,], type = "response"))
MSE_GLM <- sum((glm_predictions - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest)
We now run the GLM augmenting the dataset with all the two-way interaction between variables.
glm_model2 <- glm(Actual_Recovery_Rate ~ (Disability_Category + AgeBand + Duration_12_49 +
OwnOccToAnyTransition_MOD + Integration_with_STD +
Gender + Gross_Indexed_Benefit_Amount)^2,
data = dataTrain,
gaussian(link = "identity"))
glm_predictions2 <- predict(glm_model2, newdata = dataTest, type = "response")
(MSE_GLM <- sum((glm_predictions2 - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest))
However, the MSE increases to 0.01225949. This is probably motivated from the fact that most variables are not significant, which suggests the use of a regularization method to select the most important variables.
To apply the LASSO, we first augment the data-set with all the two-way interactions.
x.train_2 <- model.matrix(Actual_Recovery_Rate ~ (Disability_Category + AgeBand+Duration_12_49+
OwnOccToAnyTransition_MOD + Integration_with_STD +
Gender + Gross_Indexed_Benefit_Amount)^2, dataTrain)[,-1]
x.test_2 <- model.matrix(Actual_Recovery_Rate ~ (Disability_Category+AgeBand+Duration_12_49+OwnOccToAnyTransition_MOD+Integration_with_STD+
Gender+Gross_Indexed_Benefit_Amount)^2, dataTest)[,-1]
enet_model2 <- cv.glmnet(x.train_2, y.train,
family = "gaussian", # for regression
lambda = seq(0.1,0.00000001,length.out = 100),
standardize = TRUE,
alpha = 1) # .5 for elastic net, 1 for lasso
Enet_Predictions2 <- predict(enet_model2, x.test_2)
(MSE_ENET <- sum((Enet_Predictions2 - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest))
The MSE is 0.01225979, which does not offer an improvement from the GLM method with all the two-way interactions. The regularization methods also has relatively poor perfomance.
The only variables not used from the model are:
Disability_CategoryRespiratory:Integration_with_STDU: Unknown
Disability_CategoryIll-defined and Misc Conditions:AgeBand
Next, we use the package varbvs which implements Bayesian variable selection methods.
library(varbvs)
bvs_fit <- varbvs(x.train_2, y.train, Z=NULL, family="gaussian")
BVS.predictions <- predict(bvs_fit, x.test_2, NULL)
MSE_BVS <- sum((BVS.predictions - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest)
The MSE on the test set is 0.01228582. The model improves the GLM but not the LASSO.
We can evaluate the importance variables by looking at the posterior inclusion probabilities, presented in the object bvs_fit\$pip. The variables with probabilities smaller than $0.0001$ are:
Disability_CategoryCancer:AgeBand
Disability_CategoryCirculatory:AgeBand
Disability_CategoryDiabetes:AgeBand
Disability_CategoryDigestive:AgeBand
Disability_CategoryIll-defined and Misc Conditions:AgeBand
Disability_CategoryMental and Nervous:AgeBand
Disability_CategoryNervous System:AgeBand
Disability_CategoryOther:AgeBand
Disability_CategoryOther Musculoskeletal:AgeBand
Disability_CategoryDiabetes:Duration_12_49
Disability_CategoryMental and Nervous:Duration_12_49
Disability_CategoryRespiratory:Duration_12_49
AgeBand:OwnOccToAnyTransition_MODOwn+1
AgeBand:Integration_with_STDN: Not Integrated with STD
AgeBand:Integration_with_STDU: Unknown
AgeBand:Gender2:Female
Duration_12_49:OwnOccToAnyTransition_MODOwn+1
We now move back to the CART model. To improve the performance of the CART model, we first attempt bagging. The idea behind bagging is to average over many trees (which have a low bias but high variance) to reduce the variance.
library(Rborist)
tree_sequence <- seq(50, 500, by = 5)
MSE_bag <- rep(NA, length(tree_sequence))
for (i in seq_along(tree_sequence)) {
print(i)
nTree <- tree_sequence[i]
bag_Model <- Rborist(dataTrain[,-c(1:2)], dataTrain[,1],
nTree = nTree,
minNode = 150,
predFixed = ncol(dataTrain[,-c(1:2)]))
bag_predictions <- predict(bag_Model, newdata = dataTest[,-c(1:2)])
MSE_bag[i] <- sum((bag_predictions$yPred - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest)
}
As shown in the graph, as the number of trees increases, the test error tends to become stable.
library(repr)
options(repr.plot.width=6.5, repr.plot.height=3)
ggplot(data = NULL, aes(x = nTrees, y = errors_bag)) +
geom_point(size = 0.75) + xlab("Number of trees") + ylab("Test error") +
theme(plot.background = element_rect(color = "white", fill = "white"),
axis.title = element_text(size = 11, face = "bold"),
plot.title = element_text(size = 10, face = "bold", hjust = 0.5)) +
scale_x_continuous(breaks = seq(0,500,50)) +
ggtitle("Bagging error")
The lowest test error achieved is 0.01161801.
A further improvement over bagging is random forest. In random forests, in order to de-correlate the trees, at every split of each tree only a subset of $m$ variables is choosen as candidate for the splitting. A split is attempted if at least $n_{min}$ observations are in each node.
The number of trees is fixed to $250$, since from the analysis in the previous section we note that once the number of trees is greater than $250$ the error tends to be stable.
library(Rborist)
rb_Model <- Rborist(dataTrain[,-c(1,2], dataTrain[,1],
nTree = 250, # number of trees
minNode = 150, # minimum number of observation in a node to attempt a split
predFixed = 3 # number of predictors sampled in each split
)
rb_predictions <- predict(rb_Model, newdata = dataTest[,-1])
MSE_RF <- sum((rb_predictions$yPred - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest)
# Find below the code to perform the grid search over the parameters
# mtrys <- c(2,3,4,5,6,7)
# minNodes <- c(75, 100, 125, 150, 200)
# errors_rb <- matrix(NA, nrow = length(mtrys), ncol = length(minNodes))
#
# for(i in seq_along(mtrys)){
# for(j in seq_along(minNodes)){
#
# mtry <- mtrys[i]
# minNode <- minNodes[j]
#
# print(i)
# print(j)
#
# rb_Model <- Rborist(dataTrain[,-1], dataTrain[,1],
# nTree = 300,
# minNode = 150,
# predFixed = 10)
#
# rb_predictions <- predict(rb_Model, newdata = x.test[,-1])#]dataTest[,-c(1:2)])#)#
# (MSE_RB <- sum((rb_predictions$yPred - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest))
#
# errors_rb[i,j] <- MSE_RB
#
# }
# }
library(repr)
options(repr.plot.width=4.5, repr.plot.height=3)
ggplot(errors_rbdf, aes(x = Minimum.nodes, y = Error, group = as.factor(Mtry), color = as.factor(Mtry))) +
geom_point() + geom_line() +
scale_x_continuous(breaks = errors_rbdf[,2], labels = errors_rbdf[,2],
minor_breaks = c(),
name = c("minNode")) +
scale_y_continuous(breaks = seq(min(errors_rbdf[,3]),max(errors_rbdf[,3]),length.out = 5),
minor_breaks = c(), labels = round(seq(min(errors_rbdf[,3]),max(errors_rbdf[,3]),length.out = 5),6)) +
theme(axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 9),
panel.background = element_rect(fill = 'white', colour = 'black'),
panel.grid.major = element_line(colour = "grey"),
panel.grid.minor = element_line(colour = "grey")) + guides(color=guide_legend(title="predFixed"))
The lowest error achieved is 0.01158461
To help interpret the results of the random forest algorithm, we evaluate the relative importance of each variable. Later, this is repeated for the boosting method. As the boosting algorithm takes as input the complete model matrix with all the dummy variables x.train, to obtain comparable results the relative importance analysis is performed on the same dataset. The relative importance of the original variable by adding up the importance of the dummy variables of each original variable.
This procedure is performed in the code below.
namesVariables <- colnames(x.train[,-1])
importanceVariables <- rb_Model$training$info
originalNamesVariables <- colnames(dataTrain[,-c(1,2)])
correspondancesVariables <- sapply(originalNamesVariables, function(x) {
grepl(x, namesVariables)
})
newImportances <- apply(correspondancesVariables, 2, function(x){
sum(importanceVariables[x])
})
namesVariables_rb <- gsub("_", "\n", names(newImportances))
importanceVariables_rb <- as.numeric(newImportances)
Results are presented in the graph below.
library(repr)
options(repr.plot.width=6.5, repr.plot.height=4.5)
ggplot(data = NULL, aes(x = reorder(namesVariables_rb, importanceVariables_rb),
y = importanceVariables_rb)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(size = 10, face = "bold"),
axis.title = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 14, hjust = 0.5, face = "bold"),
panel.background = element_rect(fill = "white", color = "white")) +
coord_flip() + ylab("Importance") + xlab("Variable names") +
ggtitle("Relative variable importance in Random Forests")
Gradient Boosting is a sum of weak learners (usually trees) estimated sequentially. In each step of the algorithm a weak learner is fitted to the current residuals of the model. The process is performed until a maximum number $M$ of trees is grown. The impact of each tree in the model is shrunk by a parameter $\lambda$, in order to slow down the learning of the algorithm as the number of steps in the sequence increases.
The parameters we tune are the number of trees $M$, the parameter $\lambda$ and the depth of each individual tree.
library(xgboost)
xgb_model <- xgboost(data = x.train[,-1],
label = y.train,
nrounds = 1000, # number of trees, default = 100
eta = 0.3, # shrinkage parameter, default = 0.3
max_depth = 4)
xgb_predictions <- predict(xgb_model, ntreelimit = 1000, x.test[,-1])
MSE_XGB <- sum((xgb_predictions - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest)
# Find below the code used to perform the grid search
# etas <- c(25, 80, 150, 200)
# max_depths <- c(3, 4, 5, 6, 7)
# ntrees <- c(150, 250, 500, 750, 1000)
# errors_xgb <- array(NA, dim = c(length(etas), length(max_depths), length(ntrees)))
#
# for(i in seq_along(etas)){
# for(j in seq_along(max_depths)){
#
# eta <- etas[i] / max(ntrees)
# max_depth <- max_depths[j]
#
# print(paste0(i," - ",j))
#
# xgb_model <- xgboost(data = x.train[,-1],
# label = y.train,
# nrounds = max(ntrees),
# eta = eta,
# max_depth = max_depth)
#
# for(k in seq_along(ntrees)){
#
# ntree <- ntrees[k]
#
# xgb_predictions <- predict(xgb_model, ntreelimit = ntree, x.test[,-1])
#
# (MSE_XGB <- sum((xgb_predictions - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest))
#
# errors_xgb[i,j, k] <- MSE_XGB
#
# }
# }
# }
library(repr)
options(repr.plot.width=6, repr.plot.height=10)
multiplot(p1,p2,p3, cols = 1)
The lowest test error found is 0.01159593
We now investigate the importance of other secondary parameters, such as:
Instead of searching over a grid ranging over all the parameters, which would be too expensive computationally, we start from the best configuration found above ($\lambda$ = 0.15, maximum depth = 4 and $M$ = 250) and we change only one individual parameter at a time to test its individual effect on the MSE.
Below we present the plot of the MSE of several configurations of each parameter. The default parameter is shown as a red line.
library(repr)
options(repr.plot.width=7.5, repr.plot.height=6)
p_gamma <- ggplot(xgb_gamma_err, aes(x = Gamma, y = Errors)) + geom_point(size = 2) + geom_line() +
ylab("MSE") + xlab("Gamma") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 9),
panel.background = element_rect(fill = 'white', colour = 'black'),
panel.grid.major = element_line(colour = "grey"),
panel.grid.minor = element_line(colour = "grey")) +
geom_vline(aes(xintercept = 0), size = 1, color = "red") +
ggtitle("Gamma parameter")
p_subs <- ggplot(xgb_subs_err, aes(x = Subsample, y = Errors)) + geom_point(size = 2) + geom_line() +
ylab("MSE") + xlab("Subsample") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 9),
panel.background = element_rect(fill = 'white', colour = 'black'),
panel.grid.major = element_line(colour = "grey"),
panel.grid.minor = element_line(colour = "grey")) +
geom_vline(aes(xintercept = 1), size = 1, color = "red")+
ggtitle("Subsample parameter")
p_colsubs <- ggplot(xgb_colsubs_err, aes(x = Subsample, y = Errors)) + geom_point(size = 2) + geom_line() +
ylab("MSE") + xlab("Subsample") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(size = 12, face = "bold", vjust = 0.5),
axis.text = element_text(size = 9),
panel.background = element_rect(fill = 'white', colour = 'black'),
panel.grid.major = element_line(colour = "grey"),
panel.grid.minor = element_line(colour = "grey")) +
geom_vline(aes(xintercept = 1), size = 1, color = "red") +
ggtitle("Column subsamples parameter")
p_mcw <- ggplot(xgb_mcw_err, aes(x = min_child_weights, y = Errors)) + geom_point(size = 2) + geom_line() +
ylab("MSE") + xlab("Minimum child weight") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 9),
panel.background = element_rect(fill = 'white', colour = 'black'),
panel.grid.major = element_line(colour = "grey"),
panel.grid.minor = element_line(colour = "grey"))+
geom_vline(aes(xintercept = 1), size = 1, color = "red") +
ggtitle("Minimum child weight parameter")
multiplot(p_gamma, p_mcw, p_subs, p_colsubs, cols = 2)
The parameter min_child_weight is the one which is found to have more effect, while the parameters colsample_bytree and gamma show only a small decrease in the MSE. The parameter subsample, instead, only increases the MSE if changed from the default value.
As for random forests, we can also produce a plot of the relative variable importance of each variable. As explained in the random forest case, to obtain the importance of each original variable we add up the importance of the dummy variables.
importance_matrix <- xgb.importance(colnames(x.train[,-1]), model = xgb_model)
originalNamesVariables <- colnames(dataTrain[,-c(1,2)])
correspondancesVariables <- sapply(originalNamesVariables, function(x) {
grepl(x, importance_matrix$Feature)
})
newImportances <- apply(correspondancesVariables, 2, function(x){
sum(importance_matrix$Gain[x])
})
namesVariables_xgb <- gsub("_", "\n", names(newImportances))
importanceVariables_xgb <- as.numeric(newImportances)
ggplot(data = NULL, aes(x = reorder(namesVariables_xgb, importanceVariables_xgb),
y = importanceVariables_xgb)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(size = 10, face = "bold"),
axis.title = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 14, hjust = 0.5, face = "bold"),
panel.background = element_rect(fill = "white", color = "white")) +
coord_flip() + ylab("Importance") + xlab("Variable names") +
ggtitle("Relative variable importance in Gradient Boosting")
Bayesian Additive Regression Trees (BART) is an ensemble of trees. Specifically, it is a sum of $m$ trees fitted via backfitting algorithm (each tree is fitted on the residuals of the current sum of trees without the tree fitted). The tuning parameters of the algorithm are the prior on the size of each tree. Specifically, the prior on the depth $d$ of each tree is:
$$ p(d) = \alpha (1 + d)^{-\beta} $$
where $\alpha$ tunes the variability in the depth of each tree and $\beta$ tunes the depth of the tree.
For this demonstration, we leave the parameters $\alpha$ and $\beta$ as suggested by the authors of the algorithm and tune only the number of trees $m$.
library(BART)
bart_model <- wbart(x.train, y.train, x.test,
ntree = 80, # number of trees
a = 0.5, # alpha parameter
b = 1, # beta parameter
nskip = 500, # burn-in iterations to skip
ndpost = 300 # posterior iterations to keep
printevery = 25
)
bart_predictions <- bart_model$yhat.test.mean
MSE_BART <- sum((bart_predictions - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest)
# Find below the code used to perform the grid search
# ntrees <- c(80, 100, 120)
# errors_bart <- rep(NA, length(ntrees))
#
# for(j in seq_along(ntrees)){
#
# ntree <- ntrees[j]
#
# bart_model <- wbart(x.train, y.train, x.test,
# ntree = ntree,
# a = 0.5,
# b = 1,
# printevery = 25,
# nskip = 500,
# ndpost = 300)
#
# bart_predictions <- bart_model$yhat.test.mean
#
# (MSE_BART <- sum((bart_predictions - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest))
#
# errors_bart[j] <- MSE_BART
#
# }
library(repr)
options(repr.plot.width=4, repr.plot.height=3)
ggplot(bart_err, aes(x = Number.of.trees, y = Errors)) + geom_point() + geom_line() +
scale_x_continuous(breaks = bart_err[,2], labels = bart_err[,2],
name = gsub("\\."," ",colnames(bart_err)[2])) +
scale_y_continuous(breaks = seq(min(bart_err[,1]),max(bart_err[,1]),length.out = 5),
labels = round(seq(min(bart_err[,1]),max(bart_err[,1]),length.out = 5),6)) +
theme(axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 7),
panel.background = element_rect(fill = 'white', colour = 'black'),
panel.grid.major = element_line(colour = "grey"),
panel.grid.minor = element_line(colour = "grey"))
The lowest error achieved is 0.01159851
Multivariate Adaptive Regression Splines (MARS) can be seen as a generalization of forward regression, suited for high dimensional problems.
The final model is of the form:
$$ f(x) = \beta_0 + \sum_{i=1}^m \beta_i h_i(x) $$
where the functions $h_i(x)$ are from the set $C = \{ (X_j - t)_+,(t - X_j)_+ \}$, where $X_j$ is an independent variable, or product of functions in this set.
The model is trained like a forward regression, adding terms from $C$ and products of terms of $C$ with functions of the current basis.
As the full model is expected to overfit the data, some terms are removed from the model at the end of the fitting. In this respect, this procedure can be seen as similar to the CART procedure.
The main parameter to tune is the maximum degree of interaction allowed in the products between functions.
library(earth)
mars_model <- earth(Actual_Recovery_Rate ~ Disability_Category + AgeBand + Duration_12_49 +
OwnOccToAnyTransition_MOD + Integration_with_STD +
Gender + Gross_Indexed_Benefit_Amount,
degree = 5,
data = dataTrain)
mars_prediction <- predict(mars_model, dataTest, type = "response")
mars_prediction <- mars_prediction[,1]
MSE_MARS <- sum((mars_prediction - dataTest$Actual_Recovery_Rate)^2) / nrow(dataTest)
ggplot(mars_err, aes(x = Sizes, y = Errors)) + geom_point() + geom_line() +
scale_x_continuous(breaks = mars_err[,2], labels = mars_err[,2],
minor_breaks = c(),
name = gsub("\\."," ",colnames(mars_err)[2])) +
scale_y_continuous(breaks = seq(min(mars_err[,1]),max(mars_err[,1]),length.out = 5),
minor_breaks = c(),
labels = round(seq(min(mars_err[,1]),max(mars_err[,1]),length.out = 5),6)) +
theme(axis.title = element_text(size = 12, face = "bold"),
axis.text = element_text(size = 9),
panel.background = element_rect(fill = 'white', colour = 'black'),
panel.grid.major = element_line(colour = "grey"),
panel.grid.minor = element_line(colour = "grey"))
The lowest error achieved is 0.01169729.
The notably improved performance from the GLM model suggests that the MARS model is more able to capture non-linear relationships than linear models.
In this dataset the better models are found to be the ensemble learners of trees such as BART, Boosting and Random Forests.