What Machine Learning Can Do for You
By David Romoff
Risks & Rewards, February 2022
You’re quantitative. You have facility with numbers and analytics and basic statistics. But you don’t happen to be a data scientist. Maybe you’re an engineer, or a physicist, or an architect, or an actuary. You know that machine learning (ML) is out there and it’s all the rage and you've heard the stories. You even have an idea of what ML is. But what can it do for you?
You deserve to hear a sales pitch on what ML can do for you. Here it is.
Let’s Start with Linear Regression
Remember hearing about independent and dependent variables and how all of the independent variables should be independent of each other? Maybe you were taught that when you were too young to question it. If the independent variables are all related to your dependent variable, then why wouldn’t you expect the independent variables to be related to each other?
Interesting things in the real world occur in systems of variables that are related to each other. For example, in order to predict traffic accidents, you will want to consider road slipperiness, precipitation, and visibility. And of course, these predictors correlate with each other.
The main reason to require that independent variables be independent of each other is mechanical: Linear regression cannot be solved with highly correlated variables because the resulting covariance matrix will not have an inverse. But sprinkling in a little ML (regularization) can fix this. And now you are free to explore the world with “independent” variables that are related to each other. To avoid, terminology issues, we can call our independent variables, predictors.
It turns out that not only can predictors correlate with each other, but they can interact as well. Building further on the traffic accident example, some high levels of slipperiness and visibility can be more pernicious when they occur together.
If there are p predictors, then there are about p2/2 pairwise interactions; and, that’s just the beginning. The world can be more complicated than pairwise interactions. Some ML algorithms (e.g., decision trees) can explore and discover interactions between predictor variables without user intervention.
Now on to Logistic Regression
For those unfamiliar with it, logistic regression is a version of regression analysis used for predicting categories (e.g., spam versus email, fraud versus no fraud, default versus no default). Just as with linear regression, the problem of correlated predictors exists here too and can be remedied. Furthermore, after the classification has been completed, we want to test that the category estimates are reasonable? That’s where another ML algorithm called cluster analysis comes in and helps the user group similar predictions. Then you can see how the classification rate of the group compares to the predicted categorizations.
Got Missing Data? Of Course, You Do!
Some ML algorithms (e.g., random forests) work very nicely with missing data. No data cleaning is required when using these algorithms. In addition to not breaking down amid missing data, these algorithms use the fact of “missingness” as a feature to predict with. This compensates for when the missing points are not randomly missing.
Or, rather than dodge the problem, although that might be the best approach, you can impute the missing values and work from there. Here, very simple ML algorithms that look for the nearest data point (K-Nearest Neighbors) and infer its value work well. Simplicity here can be optimal because the modeling in data cleaning should not be mixed with the modeling in forecasting.
There are also remedies for missing data in time series. The challenge of time series data is that relationships exist, not just between variables, but between variables and their preceding states. And, from the point of view of a historical data point, relationships exist with the future states of the variables.
For the sake of predicting missing values, a data set can be augmented by including lagged values and negative-lagged values (i.e., future values). This, now-wider, augmented data set will have correlated predictors. The regularization trick can be used to forecast missing points with the available data. And, a strategy of repeatedly sampling, forecasting, and then averaging the forecasts can be used. Or, a similar turnkey approach is to use principal component analysis (PCA) following a similar strategy where a meta-algorithm will repeatedly impute, project, and refit until the imputed points stop changing. This is easier said than done, but it is doable.
Disclaimer: Volatility can be lost when imputing missing values. Model at your own risk!
Got Intuitions?
There are other things that ML algorithms can do that you may have been wondering about. For example, is there a way to look through a data set of categories and determine the probability of one category given some other categories? This is the famous market basket algorithm that answers the ubiquitous question, what is the probability you buy milk given you already bought bread and butter?
Another interesting analytical problem that ML addresses is how numerically measured observations can be grouped together because of their proximity to each other. That is what the cluster analysis algorithm achieves. In marketing, it can be used to create paradigm customers: This is Harry; he is age 50; he walks 2 miles per day, and spends 500 dollars a month on groceries. Let's sell him these shoes!
Finally, ML provides a statistical process that can be used to give the user a decision tree that can be used to predict a result. For example, if a merchant places a given product in the front of the store and the business is located in an urban area, what level of sales can the merchant expect? Alternatively, if he places the same product in the back of the store in a rural area, how does that affect the level of sales? The algorithm that generates this decision tree is simply called a decision tree.
But what can ML do for you on the job?
Easy Challenger Models
Imagine the following plausible scenario. Someone has built a model and it is your job to kick the tires and check it for reasonableness. This can be done quickly and efficiently using a challenger model. This involves building your own model and setting a benchmark for prediction quality. This benchmark can be set by using the regularization trick to toss in all the predictor variables you can think of into a regression model. Alternatively, the benchmark can be established by again tossing in all the predictor variables, along with their missing values, into a random forest algorithm. Job done and value added.
Enhanced Times Series Analysis
Vectorized auto regression is a nice, effective tool for times series analysis. It appears that the regularization trick adds value here too. Be sure to check that out if time series analysis is part of your quantitative world.
So that’s the sales pitch of what machine learning can do for you. Are you a buyer?
Statements of fact and opinions expressed herein are those of the individual authors and are not necessarily those of the Society of Actuaries, the editors, or the respective authors’ employers.
David Romoff, M.S., MBA, is associate director and lecturer in the Enterprise Risk Management program in the School of Professional Studies at Columbia University. He can be reached at djr2132@columbia.edu.