July 2012

# The R Corner – Aggregate Function

The aggregate function in R gives you the ability to aggregate your data inside of R instead of having to use exterior software to aggregate it. For instance, say that you have mortality data that has various categorical fields such as underwriting class (Smoker, Nonsmoker, etc.) and multiple products (UL, VUL, Traditional, etc.) and you want to estimate the raw mortality rates against just Smoker/Nonsmoker for all products, the aggregate command allows you to collapse your data easily. Below is some more fabricated mortality data. If you want to replicate the result, you can download the data here.

This data reflects mortality by policy count. It has six columns, which are attained age, sex, underwriting, product, exposed and deaths. Once you have downloaded the data, you can read it into R by using the read.cvs command:

A summary of this data.frame is:

>summary(Ult)

 AttainedAge Sex UnderWriting Min. : 55.00 Female:264 Nonsmoker:264 1st Qu.: 65.75 Male :264 Smoker :264 Median :76.50 Mean : 76.50 3rd Qu.: 87.25 Max. : 98.00

 Product Exposed Deaths Trad:176 Min. : 2394 Min. : 476 UL :176 1st Qu.: 78294 1st Qu.: 1263 VUL :176 Median :122131 Median : 3674 Mean : 110307 Mean : 4701 3rd Qu.: 148370 3rd Qu.: 7833 Max. : 210353 Max. : 13254

Say, you want to find the crude death rate for males and females (we will aggregate only on sex and the attained age, underwriting and products, will not be aggregated). To do this we create a “by” variable. We want to aggregate by sex, so use these commands:

>bySex<-Ult\$Sex
>(tab1< -aggregate(x=Ult[,5:6],by = list(bySex),FUN=”sum”))
Group.1 Exposed Deaths
1 Female 26658017 1001693
2 Male 31584000 1480493

>tab1\$Deaths/tab1\$Exposed
[1] 0.03757567 0.04687478

We store the aggregated results into tab1, which we surround by “()”—that tells R to display its contents.

Regarding the aggregate function, the x=Ult[,5:6] option is used to specify which columns that will be summed (Exposed and Deaths). They were summed because of the FUN=”sum” option. The by option has to be a list, so our bySex variable has to be converted to a list, by using the list() function. The last command, finds the crude death rates, by division of the total deaths by exposed.

Similarly, you can find the crude rates across product by creating another “byProduct” variable, but say you want to find the average death rate, by sex and product, you would use these commands:

>byProduct<-Ult\$Product
> cbind((tab2< -aggregate(x=Ult[,5:6],by=list(bySex,byProduct),FUN="mean")),
tab2\$Deaths/tab2\$Exposed)
Group.1 Group.2 Exposed Deaths tab2\$Deaths/tab2\$Exposed
1 Female Trad 100050.9 3757.682 0.03755769
2 Male Trad 118538.8 5553.761 0.04685185
3 Female UL 100050.9 3812.591 0.03810650
4 Male UL 118538.8 5635.034 0.04753747
5 Female VUL 102830.1 3812.602 0.03707670
6 Male VUL 121831.5 5634.989 0.04625230

Notice how the cbind command appends the average death rate on the right hand side of the table above. The tab2 table is first created and it is then used in the rate calculation on the right.

In actual studies, you would keep the attained ages separate, but to do this you just need a “byAttainedAge” variable in the aggregate function. Also, the aggregation occurs in the order of the “by” variables. For instance, by=list(byProduct,bySex,byAttainedAge) aggregates by product, then sex and finally by attained age.

Before discovering the aggregate function, I had to manipulate the data in Excel or Access and then re-import the aggregated data. However, by being able to remain within R, it was easier to fully automate the necessary modeling process.

You can use the aggregate function, to do record compression, by using categorical data values as “by” variables and the numeric variables need to be summed (i.e., use the FUN=”sum” parameter).