May 2016

R Corner: Predictive Modeling - Tree Models

craighead-steveBy Steven Craighead

One benefit of using predictive models is that it can give you a quick grasp of what is in the data. Sometimes, that is all you need to do. In this article, we will look at using tree regression models. You will need to install the “tree” package into R and then load that package into an R workspace. Also, you will need to load the “datasets” package that comes with the standard R installation package. We will look at several datasets from that package. Assuming that you have installed the tree package, you can get started by these commands:

Note: Remember to not copy the “>”.

>library(datasets)

>library(tree)

Classification or Taxonomy

One benefit of tree regression models is that you can use them to classify data. For instance, we may need to classify a policy as substandard and determine the risk level of that substandard policy. Or we may need to classify if a policyholder is a risky driver. Also, Risk Taxonomy is now a current topic in ERM. Now, the classification results are based upon prior observations and how a tree model is fit to this data. For instance, individuals who have a non-commercial pilot’s license, poor credit or high blood pressure, we will likely want to rate as a higher risk, than someone who does not smoke and runs marathons as a hobby. Also, we may want to approximate what the policyholder’s rating may be if he or she have all three risks, or just two or just one. Tree models may aid us in capturing the information that we have observed from our prior policy holders, but it may not help us with the actual rating of the policy. The rating component may be estimated by another type of predictive model called a generalized linear model (GLM) or generalized additive model (GAM). We will look at those model types in later articles.

Now, since I don’t have any great underwriting examples to apply this technique to, we will look at developing the taxonomy of some flower data (specifically irises) data. Use these two commands to do that:

>plot(tree(Species~.,data=iris))

>text(tree(Species~.,data=iris))

A plot should appear in a graphics window similar to below, which gives a means to identify different species. Notice how the different species of the iris data can be classified by the petal length, the petal width and the sepal length. So, if you have a short petal (less than 2.45), you have a setosa iris. If you were a botanist, you might find that this classification is all you need. 

2016-compact-may-img5

Figure 1: Iris Taxonomy

Sometimes, the plot may be either too complex or displays only the node number instead of the formula that defines the node, so you may find it necessary to be able to read the text output of the model. You obtain the model results of the regression by this command:

>tree(Species~.,data=iris)

node), split, n, deviance, yval, (yprob)

   * denotes terminal node

 1) root 150 329.600 setosa ( 0.33333 0.33333 0.33333 )

 2) Petal.Length < 2.45="" 50=""  0.000="" setosa="" (="" 1.00000="" 0.00000="" 0.00000="" )="" />

 3) Petal.Length > 2.45 100 138.600 versicolor ( 0.00000 0.50000 0.50000 )

 6) Petal.Width < 1.75="" 54="" 33.320="" versicolor="" (="" 0.00000="" 0.90741="" 0.09259="" )="" />

 12) Petal.Length < 4.95="" 48=""  9.721="" versicolor="" (="" 0.00000="" 0.97917="" 0.02083="" )="" />

 24) Sepal.Length < 5.15="" 5=""  5.004="" versicolor="" (="" 0.00000="" 0.80000="" 0.20000="" )="" />

 25) Sepal.Length > 5.15 43  0.000 versicolor ( 0.00000 1.00000 0.00000 ) *

 13) Petal.Length > 4.95 6  7.638 virginica ( 0.00000 0.33333 0.66667 ) *

 7) Petal.Width > 1.75 46  9.635 virginica ( 0.00000 0.02174 0.97826 )

 14) Petal.Length < 4.95="" 6=""  5.407="" virginica="" (="" 0.00000="" 0.16667="" 0.83333="" )="" />

 15) Petal.Length > 4.95 40  0.000 virginica ( 0.00000 0.00000 1.00000 ) *

The root which is node 1 has 150 total irises, its deviance[i] is 329.6, its default y value is setosa and there are 50 setosa, 50 versicolor and 50 virginica irises in the data.

Node 2, which is when the Petal.Length < 2.45, there are 50 examples, it has 0.0000 deviance and 100 percent are setosa. Note how it is indented from the root node. Also, note that it has a “*” which means that the node terminates.

Node 3, is where Petal.Length >= 2.45, which has the other 100 irises, has a deviance of 138.6, its default value is versicolor and 50 percent is in versicolor and 50 percent are in virginica.

Continue to compare the plot to the text and you should be able to reason out the entire regression model.

Regression—Rules of Thumb

Tree models can also be regression models, but you are not estimating the mean response against the mean predictors as in standard linear regression. These models are good for developing good rule-of-thumb models. For example, these types of models may show you that the majority of a company’s complex policyholder behavior is seen in the top ½ percent of polices, ranked by NAR or by fund value.

Tree regression creates a basic tree that models the data, but it usually also uses an automatic means to prune the tree to improve the predictive behavior of the regression. The tree package does allow the user to control this feature, but all of our models in this article assume that the default settings for pruning are being used. If you want to look at how to control pruning, please look at the prune.tree function in this package.

Let’s look at some antique car data that can give a relationship to the number of feet it takes to stop a car going at a specific mph. These commands will produce Figure 2.

> plot(tree(dist~speed,data=cars))

> text(tree(dist~speed,data=cars))

Note, that if your speed is less than 17.5 mph, but greater than 12.5 mph, your stopping distance would be about 40 feet. Also, if your speed is greater than 23.5 mph, your stopping distance is 92 feet.

To compare the tree model to a linear regression, use these commands, which will produce Figure 3:

> lmmodel <- lm(dist~speed,data="">

> plot(cars,ylim=c(0,120),ylab="",xlab="")

> abline(lmmodel,ylab="",xlab="",lwd=5)

> title("Antique Cars")

> par(new=T)

> plot(cars$speed,predict(tree(dist~speed,data=cars))
,col="2",ylim=c(0,120),type="l",

xlab="Speed",ylab="Stopping Distance",lwd=5)

The linear regression line is in black and the tree regression line is in red.

2016-compact-may-img6

Figure 2 - Stopping Distances vs. Speed

2016-compact-may-img7

Figure 3 - Antique Cars, Linear and Tree Regression

Predictive—Using Forest Models

In our prior two examples, we have looked at the ability of how a tree model can be used for classification or regression. Now, what if we developed a “forest” of tree models and use this agglomerative model to predict the response against the predictors. The R package forest creates multiple tree models, but does not do any pruning of extraneous branches. The predict function then uses all of the tree models in the forest to predict the responses. In the above example, we were able to plot the actual tree model since there was only one, but now, we must take the tree structure on faith, because it is difficult to extract the separate tree models from the agglomerative forest model. We will repeat the above example comparing the forest model against the linear regression. Figure 4 is created by using these commands:

> forestmod <- randomforest(dist~speed,data="">

> plot(cars,ylim=c(0,120),ylab="",xlab="")

> abline(lmmodel,ylab="",xlab="",lwd=5)

> title("Antique Cars")

> par(new=T)

> plot(cars$speed,predict(forestmod),col="2",ylim=c(0,120), type="l",xlab="Speed",ylab="Stopping Distance",lwd=5)

Above, I set the number of tree models to be ntree=1500, which is definitely overkill, but, you can see from the red line, that the forest regression follows the data.

2016-compact-may-img8

Figure 4 - Regression vs forest Model

To compare how well the two models fit the data, consider the square root of the sum of difference of expected to actual square. Here are the commands that you can use to do this:

> sum((predict(lmmodel) - cars[,2])^2)^.5

[1] 106.5529

> sum((predict(forestmod) - cars[,2])^2)^.5

[1] 112.4596

In the above situation, we see that the linear model, as expected performed the best, since linear regression is designed to minimize the squared distance between the predicted against the actual.  However, in other situations, where the data does not meet the underlying assumptions that allow you to use linear regression, you may find the predictive behavior of a forest model to be useful.

This article has outlined how to use tree and forest in predictive modeling. Tree models have the advantage to create taxonomy models or to create nice rules of thumb to explain the underlying relationships within the data. Predictive modeling has also included the concept of agglomeration, where the modeler can use multiple models in conjunction, to produce a better prediction, in the place of separate individual models. We looked at a simple example of this by building a forest model, where we had 1500 separate tree models, to represent the underlying relationships.

In our next article, we will look at generalized linear models, where we examine the offset feature to generate useful information for mortality or other underwriting studies.

Steven Craighead, ASA, CERA, MAAA, is a director with Pacific Life Insurance Company. He can be reached at  steven.craighead@pacificlife.com.


[i] Deviance is one type of measure of distance between two different probability models. It can be used to measure how different a model is from null model, where there is no predictive element, or it can be used as a measure of a goodness of fit between different models. In tree models deviance is used to aid in the decision of when to snip off branches of the tree using the prune.tree function. You will see as you look further down the list of nodes, the deviance is reduced and becomes 0 frequently at a terminal node, since that node is a certainty. In node 24 the deviance > 0, where four out of five irises are actually versicolor but one out of the five is virginica. In node 25, the deviance is = 0, where 43 of the irises are definitely versicolor.