Link to the search page

Ensuring Model Wellbeing Through Monitoring

This article discusses techniques to monitor and measure predictive models for degradation in data or predictions over time.


The world changes: Data drifts, coding schemas are modified, and the implicit assumptions made when first creating a model lose validity over time. Sometimes the changes occur rapidly and the model simply “breaks” and at other times it is a slow and incremental degradation. Regardless of the speed, it is vital that after implementing a model you also monitor its performance so you can react appropriately. This is particularly true when the model is deployed in a real-time environment where predictions are being generated continuously. Such checks will help ensure proper model maintenance takes place, and when the time is appropriate, model retraining.

What is Model Monitoring?

Monitoring production models can in many ways look similar to the original model validation exercise. For instance, for regression models, part of model monitoring might be tracking measurements like mean absolute error or mean squared error for model predictions over time to see how they are evolving. Ultimately, whatever metrics constitute your key indicators of performance should be continually rerun and rerun as often as is feasible. However, the response values won’t be available at the time of prediction, if ever, so model validation statistics that utilize the response will always be a lagging indicator.

Consequently, a big part of monitoring models is checking for consistency in the data. Is the underlying population changing from the model training? Are the model predictions or features drifting? These are not only key questions—how you answer them matters because different methods have different setup costs, maintenance, as well as sensitivity in detecting differences. There are myriad ways of looking at them: Measures of central tendency, data visualization, information theory measures, and statistical measures. 

Where to Begin?

While measures of central tendency such as the median or average are useful when evaluating the consistency of your predictions and features, one thing to be aware of is that data drift can occur in a way that is hidden in aggregate statistics. For instance, perhaps your predictions are increasing for lower ages and decreasing for upper ages such that the average stays the same. Any cut of the data that considers only how the average model prediction is changing might miss this trend completely. So, while high-level metrics like the mean or median of a distribution can be useful, looking at the distributions or percentiles can also be valuable.

There are multiple ways of going beyond the average. Data visualizations, such as simple histograms, are one of the most intuitive ways to check for consistency in the data. However, for real-time models with hundreds to thousands of features, visualizations are impractical, at least as a first order filter. Information theoretic measures and statistical measures can bridge the gap, by comparing the training distribution to the current distribution (or some recent segment) of values in the production environment in a more holistic fashion. Information theory measures include measurements like KL-Divergence[1], which tells you how different a particular distribution is from another reference distribution. Statistical methods might test whether two distributions are statistically the same using p-values as thresholds to determine significance. For instance, you might use a Wilcoxen Rank Sum test to test the null hypothesis that there has been a shift in the distribution.

Even when data or prediction drift is identified, it’s not necessarily a problem. Superficial differences between the model training and the production population are often explainable as shifts in the underlying population (i.e., your model is still working as intended, it’s simply being applied to a somewhat different mix of people). The best way to understand if this is happening is simply to monitor your model features and predictions across the various subcategories of interest such as business type, demographics, or geography. As a simplified example, you might run your model in territories A and B with anticipated average model predictions of 0.8 and 1.2 respectively. If your new business mix shifts from 50 percent/50 percent in territories A and B respectively in year 1 to 25 percent/75 percent in year 2, then your average score will be 10 percent higher. Only by evaluating your business mix will you understand the dynamic of what is occurring. If your business is dynamic across multiple dimensions, you might need to normalize using a simple model with control variables to control for all the features changing simultaneously.

But My Data is Special!

Different sorts of data require idiosyncratic adjustments, some of which can be ad hoc. For instance, medical forms of data such as electronic health records or claims data require monitoring for changes in drug prescribing patterns and to account for newly added medical codes that would have limited to no historical data. Likewise, financial and credit data requires recognition of economic and credit cycles and how that feeds back into your model features and predictions. More generally, just about every data source is susceptible to system-wide shocks like the COVID-19 pandemic. These trends need to be monitored and accounted for appropriately when making any diagnoses of model health.

Consistency is Key

The most important part of model monitoring is developing consistent habits and to integrate such checks into repeatable and consistently performed processes. As with most things, automatic or semi-automatic report generation beats a purely manual process but virtually everything beats negligence. Unfortunately, model monitoring work isn’t glamorous and “it doesn’t matter until it does” which can make it easy to push aside in favor of more pressing work. 

Getting Sophisticated

For real-time models, manual or periodic checks of the data may not cut it. Changes to data can happen faster than humans can identify either due to the nature of the data itself or simply due to errors in automatic data processing. If you waited until the next set of monthly dashboards is updated, you might be too late. In these cases, it is crucial to have a system in place that can spot changes rapidly, perhaps even in real time and without human intervention. These real-time checks could simply be programmatic checks of the data and/or predictions for stability. More sophisticated methods can include anomaly detection[1] and change detection[2] methods.

Anomaly detection methods in essence search out data points that are significantly different than the data they were trained on. Since anomaly detection methods are themselves predictive models, they can be run concurrently with your model in a production environment and potentially spot anomalies that a human can’t practically detect. For instance, imagine in your model's training data the model you are monitoring has encountered a credible number of observations with Feature X=3 and a credible number of observations with Feature Y=100 among the hundreds of other variables in the dataset. However, it has never seen an observation that had both X=3 and Y=100 simultaneously. That’s pretty anomalous and potentially something you’d like to flag but spotting such instances with manual checks is intractable due to the large number of possibilities. Instead, an anomaly detection model could identify anomalous instances and flag these observations for further review, suppress results, or issue a warning. Different potential algorithms in the anomaly detection area include Isolation Forest[4], clustering methods like DBSCAN[5], and time series models.

While anomaly detection algorithms typically look for individual instances that are outliers, change detection methods can be used to detect sudden shifts in distribution. Cumsum is a prototypical change detection algorithm. As a somewhat simplistic example of how change detection might work, you could compute a rolling average of the last 50 values of a model feature and create an alarm if it passes some predetermined threshold. For instance, if the expected average is 100 but the rolling average of the last 50 observations is 120, that might trigger a notification.

One thing that needs to be carefully managed in automated or algorithmic checking is the frequency at which your process identifies true positives and false positives. Care should be taken to make sure errors aren’t slipping through the process, but this needs to be balanced against frequent system shutdowns and/or kicking out a high percentage of cases for manual review.

Model monitoring need not be a homegrown process. Several vendors offer solutions to automate the process and treat the pain points, and multiple open-source solutions exist as well. An example of an open-source solution is Evidently AI, which implements many of the metrics and visualizations mentioned previously[6]. Dedicated software may be particularly useful in large organizations with multiple data flows and many predictive models since simply monitoring all the inner workings of the models can become a drag on a data science team.

Whatever tools or methods you ultimately use, the most important step is having a thoughtful process for monitoring that can identify problems as they arise and allows you to intervene rapidly.  

Statements of fact and opinions expressed herein are those of the individual author and are not necessarily those of the Society of Actuaries, the newsletter editors, or the respective author’s employer.


[2] Al-amri, R.; Murugesan, R.K.; Man, M.; Abdulateef, A.F.; Al-Sharafi, M.A.; Alkahtani, A.A. A Review of Machine Learning and Deep Learning Techniques for Anomaly Detection in IoT Data. Appl. Sci. 2021, 11, 5320. 10.3390/app11125320

[3] van den Burg, Gerrit J. J.; Williams, Christopher K. I. (May 26, 2020). "An Evaluation of Change Point Detection Algorithms." arXiv:2003.06222

[4] F. T. Liu, K. M. Ting and Z. Zhou, "Isolation Forest," 2008 Eighth IEEE International Conference on Data Mining, 2008, pp. 413-422, doi: 10.1109/ICDM.2008.17.

[5] Ester, M, Kriegel, H P, Sander, J, & Xiaowei, Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. United States.