Estimating Member-Level Treatment Effects with Uplift Modeling

By Randy Beams and Tanner Boyle

Long-Term Care News, August 2024

Photo of a laptop with icons of apps

In the dynamic landscape of long-term care (LTC), insurers are continually seeking innovative ways to proactively manage their LTC blocks, often finding inspiration by looking at the methods and statistical analyses in adjacent fields of study. For example, in the last 15 years, there has been an explosion in value-based care in the Medicare market, shifting the paradigm away from traditional fee-for-service models to ones that focus on quality of care, provider performance and the patient experience.[1] Private LTC insurers are adopting value-based care principles, implementing care management programs that improve the lives of their policyholders before and during a claim.[2] We are witnessing the expansion of aging-in-place initiatives in the private LTC insurance market and the development of advanced risk modeling techniques to better allocate care resources. The improved care and targeted initiatives may improve patients’ health by promoting aging-in-place while reducing claim costs for the insurer. However, measuring the effectiveness of these new programs is a challenging task.

In this article, we will explore powerful tools that can enhance traditional effectiveness studies and address some of their existing weaknesses. Propensity score matching allows for the creation of synthetic control groups, which are key for observational studies that lack a properly defined control group.[3] Uplift modeling allows us to directly estimate the effect of a treatment on each member individually, opening the door to new types of analyses that can augment the traditional approaches focused on average effects across groups of people. These new analyses could be used to assist with assigning personalized wellness programs to specific individuals.[4]

To better illustrate this, we will be exploring these methods through the lens of a hypothetical LTC fall prevention program aimed at preventing or delaying future claims for insureds. Our fictional insurer will be providing in-home visits from experienced care staff to install handrails and provide rehabilitation services and education regarding fall prevention.

Experimental Design Considerations

The core of any causal inference problem is the following: if we knew someone’s future claim probability given that a treatment was received and their future claim probability given that no treatment was received, then the difference between the two claim probabilities would be the treatment’s effectiveness. However, in almost all cases, it is impossible both to give a treatment and to withhold a treatment from the same person. Because of this, a proxy is typically used, instead looking at the difference in claim probability between two different insureds (or groups of insureds) that are as statistically similar as possible.

The gold standard in experimental design for interpretable effectiveness studies is a randomized controlled trial (RCT), where a group of people is randomly split into a control group and a treatment group.[5] Then the difference in average claim probabilities between the two groups is the estimated average treatment effect.

However, when it comes to effectiveness studies in the LTC space, RCTs are not always feasible for a few main reasons:

  • Treatments are often expensive and limited in supply, so they might be offered only to people who are expected to benefit most.
  • Ethical concerns can be raised when a potential life-altering treatment is withheld from someone who could seriously benefit from it in order to place them in the control group.
  • Relatively few wellness and care management programs have enough insureds to design statistically credible RCTs. Additionally, for those that do, it is common to see low credibility in subpopulations of interest (e.g., insureds in specific age/gender bands).

In lieu of an RCT, many effectiveness studies begin with a treatment group (of people who were selected by the provider or who self-selected into the program) and a nontreatment group (of people who chose not to participate or were not offered participation). These two populations are likely to be different in meaningful ways and cannot be treated as a control and treatment group since they were not selected in a controlled and random way. In this situation, a thorough understanding of the ways that bias can impact the reliability of estimates produced by an effectiveness study, regardless of the method that was used to estimate effectiveness, is imperative. For example, if individuals were given the option to participate in the program, this introduces selection bias, as it is challenging to determine the true reasons why someone would choose to participate or not participate in the program. The choice to participate likely correlates with the outcome that is being measured, which can distort the estimated effectiveness (e.g., individuals that chose to participate in a wellness program might already be proactively managing their health, causing the treatment to appear to be more effective than it truly was).

Selection bias is a difficult problem to solve and is a topic reserved for another (lengthier) paper. Therefore, throughout the rest of our discussion, in the context of our hypothetical example, we will assume that the effects of selection bias are minimal, and instead focus on introducing methods to estimate effectiveness and reduce other biases – and their potential applications in the space of personalized wellness programs.

Creating a Synthetic Control Group

When an effectiveness study doesn’t start with robust control and treatment groups, it is common to use propensity score matching to create a synthetic control group – mimicking the conditions of an RCT. Propensity score matching attempts to simulate the conditions of randomized assignment of those who do get the treatment in a study and those who do not. It aims to maximize the similarity of the two groups across all confounding variables that could be related to participation in the wellness program or potential future incurred claims. This helps ensure a fairer comparison of the groups’ outcomes, simulating a scenario where each individual had an equal chance of being assigned to either group. A confounding variable, or confounder, is something that can distort our understanding of whether the treatment really works or not, because it affects both the treatment and outcome of interest.[6] All causal inference methods require an assumption of no uncontrolled (or unmeasured) confounding in order to make a clear statement on causality.

In the context of our hypothetical LTC wellness program, a logistic regression model would be trained to predict the probability (or propensity) of participation for two populations: insureds who were enrolled in the wellness program and insureds who were not offered participation. This model would be trained with variables like age, gender, marriage status, policy-level characteristics like duration and past utilization, and other characteristics related to participation in the program or the likelihood of incurring an LTC claim. Most importantly, the model would only be trained with data that was available and known before the start of the program to ensure that the two populations are otherwise comparable.

Once the propensity scores are calculated, a synthetic control group is created by matching insureds in the treatment group with the insureds outside of the treatment group who have similar propensity scores. There are many ways to do the matching process, but the most common is one-to-one matching without replacement, where each treatment member is “matched” with the single non-treatment member with the closest propensity score.

Propensity score matching can be a helpful tool to simulate an RCT, but it is not a panacea for the methodological challenges that non-RCT effectiveness studies face. Two groups having similar propensity scores alone does not ensure that they were defined properly. There are a wide range of diagnostic tools available to test the statistical similarity between two groups, and as with any data analysis tool, they are limited to the data that were collected. Any key information that is missing related to the true propensity to participate in the wellness program or the true probability of experiencing a claim can bias the entire process and result in overconfidence and/or misstatements in the treatment’s estimated effect. In some instances, limited data on key confounding variables may prevent a meaningful interpretation of the treatment’s effect altogether.

Imagine that our hypothetical fall prevention program was offered to all insureds across a large state, but the bulk of the medical support was only available in the state’s capital. If our dataset didn’t include variables related to insureds’ home addresses (or their distances to the nearest care facility), then it is possible that the propensity scores would be heavily biased. In this case, a synthetic control group would likely not be a suitable comparison for our treatment group to draw conclusions on program effectiveness. Motivation to change behaviors (e.g., compliance with medication protocols) or follow safety recommendations is another example of a variable that may impact the probability of experiencing a claim, but it is rarely included in the matching process because it is very difficult to measure.

Traditional Effectiveness Methods

Once the treatment group has been balanced with a well-constructed control group through propensity score matching or an RCT, the next step is to estimate the effect of the treatment or wellness program. For example, in the year following the fall prevention program, perhaps the treatment group experiences a 1% claim incidence rate, and the control group experiences a 2.5% claim incidence rate. Then we would estimate that the treatment reduced claim incidence rates by an additive 1.5%.

This direct comparison is only possible through an RCT design or with a propensity score matching framework that effectively balances control and treatment groups according to their propensities and all confounding variables. However, in practice, it is still possible that neither method can produce completely balanced control and treatment groups—particularly when dealing with studies that have small sample sizes or include strong components related to noncompliance and/or adherence. Additionally, the treatment effect of any specific LTC wellness program, like our hypothetical example, is likely nonhomogeneous (i.e., heterogeneous), in that there may be subgroups of the population with varying treatment effects. Analyzing treatment effects at the subgroup level can be challenging, as the control and treatment groups may be balanced overall but imbalanced for specific subgroups, and the subgroups themselves may not be of sufficient size to draw statistically credible conclusions.

One approach that can be used to help mitigate these issues is the addition of a well-specified expectation to further control for any imbalances between the control and treatment groups. An actual-to-expected analysis (A:E) can be used to compare the treatment and control groups’ actual experience to their expected outcomes. If the A:E ratio of the treatment group is significantly different than the A:E ratio of the control group, then the difference between the two is an estimate of the program’s effectiveness. An A:E analysis can also assist with propensity score matching by ensuring that the expected-to-expected ratio (E:E) of the treatment and synthetic control groups is near 1.0.

The A:E method can be a great way to reduce imbalance issues in effectiveness studies and identify varying treatment effect trends in subpopulations. For subpopulations of interest with limited credibility, however, the treatment’s true effect can be misstated—sometimes drastically. Limited credibility in the presence of heterogeneous treatment effects is not a problem that can be “solved” by any specific technique or method, so instead we look for solutions that can help pool experience across cohorts and dampen effects that are not credible.

What Is Uplift Modeling?

In principle, most traditional effectiveness measurement methods rely on grouping insureds and estimating an average treatment effect (ATE). Uplift modeling can be used alongside traditional methods by estimating the individual treatment effect (ITE) for all members within the population being studied. In our hypothetical example, a member’s ITE can be interpreted as the estimated reduction in claim probability due to the fall prevention program.

One of the simplest methods to estimate uplift is called the “two model approach” and involves training two separate “base models” to predict claim probability: one on the treatment group and one on the control group. Then each member within the total population member is passed through both models to predict their claim probability if they participated in the wellness program and if they did not participate (the counterfactual). The difference between the two probabilities is equal to the ITE. After training, the uplift model can be validated by averaging the ITE for all members within a certain cohort (e.g., the entire treatment group, all females above 65, etc.) and comparing it to the ATE calculated from traditional methods.

Finally, there is ongoing research surrounding the construction of confidence intervals to add insight into the reliability of the ITE estimates.[7] If the ITE confidence interval includes zero, then this indicates that there was not significant evidence of a treatment effect or that the effect is not credible. It is important to note that uplift modeling doesn’t remove the need for a well-balanced control group constructed from an RCT or synthetically via propensity score matching. If the groups being compared are not similar across confounding variables, then the estimates from an uplift model will not be reliable.

One of the main benefits of incorporating uplift modeling into your existing effectiveness study framework is that it can help address limited credibility in cohorts of interest. For example, the size of some age/gender bands might be relatively small after being split into control and treatment groups. However, the predictive models underlying an uplift analysis are theoretically capable of learning general relationships between age, gender, and future incurred claim probability. This allows us to pool experience across individual variables (or combinations of variables) instead of pooling only across groups of members with the same combination of variables. Additionally, estimated ITEs can be used to analyze treatment effect heterogeneity and gain insights on individual insureds.

For example, perhaps the small cohort of males over 75 have an estimated ATE of –0.5% (i.e., claim incidence is estimated to be reduced by an additive 0.5% because of the treatment), which might be a smaller impact than we expected. Using uplift modeling, we might identify that the insureds in this cohort who benefit least from the treatment are all married or living with children, and that the insureds who are single or widowed all have estimated ITEs of –2% or higher. This relationship would support the hypothesis that a fall prevention service is more effective for someone who is living alone than for someone who is living with a spouse or children. As mentioned, there is no technique that can “solve” limited credibility, but this is just one example of how uplift modeling can help traditional methods mitigate its effects.

To further address potential credibility issues, similar to the A:E approach discussed earlier, the base outcome prediction models can be constructed to utilize a robust expectation (e.g., expected probability of future incurred claims) as their backbone. For example, assume that we will measure the effectiveness of our hypothetical wellness program through the actual 12-month claim incidence rate following the end of the program. We could utilize a highly predictive 12-month claim incidence model trained on LTC industry experience, traditional policy characteristics and third-party data[8] as a baseline expectation for the “two model” uplift approach. Then the control and treatment models would each make separate adjustments to this refined expectation according to their own experience. This framework allows for the blending of industry experience with that of the study experience, potentially helping further identify the true treatment effect.

Data-Driven Decision-making

At the core of many uplift methods are outcome prediction models like linear regression and random forests. Desirable attributes, such as coefficients and feature importance methods, are therefore applicable in uplift contexts, allowing for the exploration of each feature’s contribution to the ITE estimates.

One of the most popular feature importance methods in machine learning today is the usage of SHapley Additive exPlanations (SHAP).[9] It is rooted in economic game theory, and it works by taking the prediction for a given observation and estimating how much each variable contributed. For example, a certain insured might have a predicted ITE of –2.5% (i.e., an estimated additive reduction in future incurred claim probability of –2.5%):

  1. Start with the population’s average ITE of 0.5%.
  2. Add in the contribution from age of –1.5% (0.5% → –1%).
  3. Add in the contribution from gender of +1% (–1% → 0%).
  4. Add in the contribution from claim history of –2.5% (0% → –2.5%).

This opens the doors to new possibilities in effectiveness studies because (1) you can estimate how helpful the treatment was for a given insured, and (2) you can estimate why it was so effective. This information can be used to inform future studies and to help match the right treatments and programs to the right insureds, potentially opening doors to personalized wellness programs and value-based care initiatives.

For example, a trained uplift model can be packaged up and taken to new populations to predict potential effectiveness—to the extent that the training population generalizes well to any new populations. Imagine the case where a wellness program pilot was rolled out for a small population of interest. Uplift modeling could be used to estimate which subpopulations of insureds benefited most and quantify how much each insured benefited. The trained uplift model could then be run on an LTC carrier’s entire block of policyholders to estimate who might benefit most from this wellness program. This information could be used to plan follow-up pilots in other areas where those specific subpopulations are most prevalent—and even provide an estimated return on investment based on the policyholder mix.

Long-er Term Care

As the LTC sector continues to invest in the health and well-being of its policyholders through new wellness programs, it is crucial that it also invests in the ability to analyze the effectiveness of such programs. Uplift modeling is not the only solution (for all the reasons that make effectiveness studies difficult to execute), but the estimation of member-level treatment effects could be a valuable tool for any actuary or statistician working in causal inference. Used alongside traditional methods, it can help open doors to the personalized and individualized care that will help seniors be healthier, feel safer and stay at home longer.

Statements of fact and opinions expressed herein are those of the individual authors and are not necessarily those of the Society of Actuaries, the editors, or the respective authors’ employers.


Randy Beams, FSA, is a life and long-term care actuary at Milliman. Randy can be reached at randy.beams@milliman.com.

Tanner Boyle, MS, is a managing data scientist at Milliman. Tanner can be reached at tanner.boyle@milliman.com.


Endnotes

[1] Scott Rifkin and James M. Berclan, “Improving Outcomes through Value-Based Care,” webinar, McKnight’s Long-Term Care News, October 19, 2023, https://www.mcknights.com/events/webinars/improving-outcomes-through-value-based-care/.

[2] Thomas Rapp, Implementing Value-Based Aging in Our Long-Term Care Systems, Value & Outcomes Spotlight 7, no. 4, July/August 2021, https://www.ispor.org/publications/journals/value-outcomes-spotlight/vos-archives/issue/view/the-benefits-and-challenges-of-aging-in-place/implementing-value-based-aging-in-our-long-term-care-systems.

[3] Peter C. Austin, “An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies,” Multivariate Behavioral Research 46, no. 3 (2011): 399–424, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/.

[4] Xiajing Gong, Meng Hu, Mahashweta Basu and Liang Zhao, “Heterogeneous Treatment Effect Analysis Based on Machine-Learning Methodology,” CPT Pharmacometrics & Systems Pharmacology 10, no. 11 (2021): 1433–43, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8592515/.

[5] Robert Eaton, Juliet Spector, Jeff Anderson et al., “Evaluating LTC Population Health Programs,” Long-Term Care News, February 2023, https://www.soa.org/sections/long-term-care/long-term-care-newsletter/2023/february/ltc-2023-02-eaton/.

[6] Jeff Y. Yang, Michael Webster-Clark, Jennifer L. Lund et al., “Propensity Score Methods to Control for Confounding in Observational Cohort Studies: A Statistical Primer and Application to Endoscopy Research,” Gastrointestinal Endoscopy 90, no. 3 (2019): 360–69, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6715456.

[7] Lihua Lei and Emmanuel J. Candès, “Conformal Influence of Counterfactuals and Individual Treatment Effects,” Arvix, May 6, 2021, https://arxiv.org/pdf/2006.06138.pdf.

[8] Milliman, “Milliman LTC Advanced Risk Analytics™ (Milliman LARA™): Superior Predictive Performance of Milliman LARA Models,” Milliman, 2023, https://www.milliman.com/-/media/products/lara/superior-predictive-performance-of-milliman-lara-models.ashx.

[9] Scott M. Lundberg and Su-In Lee, “A Unified Approach to Interpreting Model Predictions,” in NIPS 2017: Proceedings of the 31st International Conference on Neural Information Processing Systems, ed. Ulrike von Luxburg and Isabelle Guyon (Long Beach, CA: Curran Associates, 2017), 4768–77.