July 2012

The R Corner – Aggregate Function

By Steve Craighead

The aggregate function in R gives you the ability to aggregate your data inside of R instead of having to use exterior software to aggregate it. For instance, say that you have mortality data that has various categorical fields such as underwriting class (Smoker, Nonsmoker, etc.) and multiple products (UL, VUL, Traditional, etc.) and you want to estimate the raw mortality rates against just Smoker/Nonsmoker for all products, the aggregate command allows you to collapse your data easily. Below is some more fabricated mortality data. If you want to replicate the result, you can download the data here.

This data reflects mortality by policy count. It has six columns, which are attained age, sex, underwriting, product, exposed and deaths. Once you have downloaded the data, you can read it into R by using the read.cvs command:

>Ult< -read.csv("RawMortality.csv",header=TRUE)

A summary of this data.frame is:


AttainedAge Sex UnderWriting
Min. : 55.00 Female:264 Nonsmoker:264
1st Qu.: 65.75 Male :264 Smoker :264
Median :76.50    
Mean : 76.50    
3rd Qu.: 87.25    
Max. : 98.00    


Product Exposed Deaths
Trad:176 Min. : 2394 Min. : 476
UL :176 1st Qu.: 78294 1st Qu.: 1263
VUL :176 Median :122131 Median : 3674
  Mean : 110307 Mean : 4701
  3rd Qu.: 148370 3rd Qu.: 7833
  Max. : 210353 Max. : 13254


 Say, you want to find the crude death rate for males and females (we will aggregate only on sex and the attained age, underwriting and products, will not be aggregated). To do this we create a “by” variable. We want to aggregate by sex, so use these commands:

>(tab1< -aggregate(x=Ult[,5:6],by = list(bySex),FUN=”sum”))
Group.1 Exposed Deaths
1 Female 26658017 1001693
2 Male 31584000 1480493

[1] 0.03757567 0.04687478

We store the aggregated results into tab1, which we surround by “()”—that tells R to display its contents.

Regarding the aggregate function, the x=Ult[,5:6] option is used to specify which columns that will be summed (Exposed and Deaths). They were summed because of the FUN=”sum” option. The by option has to be a list, so our bySex variable has to be converted to a list, by using the list() function. The last command, finds the crude death rates, by division of the total deaths by exposed.

Similarly, you can find the crude rates across product by creating another “byProduct” variable, but say you want to find the average death rate, by sex and product, you would use these commands:

> cbind((tab2< -aggregate(x=Ult[,5:6],by=list(bySex,byProduct),FUN="mean")),
Group.1 Group.2 Exposed Deaths tab2$Deaths/tab2$Exposed
1 Female Trad 100050.9 3757.682 0.03755769
2 Male Trad 118538.8 5553.761 0.04685185
3 Female UL 100050.9 3812.591 0.03810650
4 Male UL 118538.8 5635.034 0.04753747
5 Female VUL 102830.1 3812.602 0.03707670
6 Male VUL 121831.5 5634.989 0.04625230

Notice how the cbind command appends the average death rate on the right hand side of the table above. The tab2 table is first created and it is then used in the rate calculation on the right.

In actual studies, you would keep the attained ages separate, but to do this you just need a “byAttainedAge” variable in the aggregate function. Also, the aggregation occurs in the order of the “by” variables. For instance, by=list(byProduct,bySex,byAttainedAge) aggregates by product, then sex and finally by attained age.

Before discovering the aggregate function, I had to manipulate the data in Excel or Access and then re-import the aggregated data. However, by being able to remain within R, it was easier to fully automate the necessary modeling process.

You can use the aggregate function, to do record compression, by using categorical data values as “by” variables and the numeric variables need to be summed (i.e., use the FUN=”sum” parameter).

To learn more about the other features of aggregate, while in R, use this command:

Steven Craighead, CERA, ASA, MAAA, is an actuarial consultant at Pacific Life Insurance. He can be reached at steven.craighead@pacificlife.com.