March 2016

R Corner – Predictive Modeling - Data

craighead-steve By Steven Craighead

When developing any actuarial model, be it a business model or a predictive model, frequently locating and analyzing data actually takes the majority of your time. So, in this article, we will look at two different sources of data, which we will refer to in our future predictive modeling articles.

Historical Earthquake Data

One dataset that I have used before in my Ring of Fire R Corner article is the complete history of earthquakes since 1898. You can retrieve historical earthquake data from the Advanced National Seismic System catalog, which is hosted by the Northern California Earthquake Data Center by using this link.

The link allows for you to specify the seismic data back from 1898 to present in various data formats by magnitude, depth, latitude and longitude. You can also choose which events to be extracted such as earthquakes, blasts or both. Also, you can extract events that have no magnitude and other advanced parameters. There are more than 2.9 million records, so you also need to set the line limit to obtain all of the output. To obtain an actual file, you need to send the output to an anonymous FTP site on the Northern California Earthquake Data Center site.

For my example below, I have chosen to output all of the data to a CSV-formatted file with these settings:
Your search parameters are:
catalog=ANSS
start_time=1895/01/01,00:00:00
end_time=2016/01/08,00:00:00
minimum_latitude=-90
maximum_latitude=90
minimum_longitude=-180
maximum_longitude=180
minimum_magnitude=0
maximum_magnitude=10
minimum_depth=0
maximum_depth=4000
event_type=A

Once you submit the request to send your output to an anonymous FTP NCEDC file, the above information will be displayed and you need to wait (possibly several minutes), before the catalog search will return with a URL that you link to in order to download the file. For instance, my file was: ftp://www.ncedc.org/outgoing/userdata/web/catsearch.4052. Your results will be at the same location, just with a different number after the “catsearch.” I downloaded the “catsearch.4052” file into “catsearch.csv” in the directory of my R workspace.

I just used the “read.csv” function to load the information into R into the “x” object:

>x <->

You can examine the available fields by using this command:

>names(x)

[1] "DateTime" "Latitude" "Longitude" "Depth" "Magnitude" "MagType" "NbStations" "Gap"
[9] "Distance" "RMS" "Source" "EventID"

Now, let’s look at a different source of data.

Dataset Package

The second data source that we will be using in our models will be chosen from some of the various datasets available in the R package Datasets(is this the proper names? Why the capital D?). Please use your package manager and download and install this package into your R environment.

Once you have installed the package, you can examine the full list of available datasets by using this command:

library(help = "datasets")

You can also find the full list of the datasets here.

Staying with the geological theme there are two datasets; volcano and quakes. In volcano, the dataset has the topographical information on Maunga Whau (Mt. Eden) which is one of the volcanos in the Auckland volcanic field.

The quakes dataset has the locations of 1,000 seismic events of magnitude greater than four that are near Fiji. This set has the latitude, longitude, depth, magnitude and the number of seismic stations reporting. The entire earthquake dataset described in the first section should contain all of these events.

Now for something completely different and to demonstrate the variety of datasets within the package, the cars dataset has the stopping distance relative to the speed of the car. This data is a bit stale in that it was originally created in the 1920s. It was one of the datasets used by Mordecai Ezekiel in correlation analysis.

There are other datasets related to intelligence quotient, airline passenger miles, air quality and amusingly, the body temperature of two beavers. I’m unsure what the two beavers are doing, but maybe we can use predictive modeling to find out.

In the next article, we will look at how to conduct tree regression and possibly create some random forest models on one or more of our datasets.

Steven Craighead, ASA, CERA, MAAA, is a consultant with Pacific Life Insurance Company. He can be reached at steven.craighead@pacificlife.com.