Evaluating LTC Population Health Programs

By Robert Eaton, Juliet Spector, Jeff Anderson, Joe Long, Brian Hartman and Missy Gordon

Long-Term Care News, February 2023


The case for proactively managing current and future long-term care (LTC) insurance (LTCI) claims is stronger than ever. While insurers have anticipated these claims within their reserves, LTCI claims are accelerating, and companies are looking for additional methods to manage their LTC blocks and serve their policyholders.

We see LTC insurers in the market acting accordingly, implementing programs to improve the lives of their policyholders before and during a claim. Because reserves already include an actuarial expectation of future claim levels, business managers look to improve mutual policyholder and company value to justify the cost of these programs.

Measuring the financial impact of the programs can be difficult for a variety of reasons: the volume of business, the availability of historical data, the length of time it takes to see the impact on future claims and the paucity of pre-claimant information. Researchers use many methods to evaluate treatments or programs. Some that may be considered for evaluating LTCI health improvement programs include a randomized control trial (RCT), a temporal control trial and an observational study. Of these, the randomized control trial is accepted as the most robust; it provides the most accurate measure of program effect. Researchers rely on the law of large numbers and random treatment assignment to produce unbiased estimates of the treatment effect.

In observational studies, study and control groups are not randomly assigned at the start; therefore, sampling techniques such as propensity score matching[1] should be applied to create a control group sampled from the full population that is similar to the study group.

This article proposes two different methods of randomized control trials, including the required data and methodology, to evaluate such programs through a case study approach.

  1. Randomized control trial. If the researcher (e.g., actuary) has the benefit of designing the study in advance, they can take advantage of some of the powerful features of the RCT. This can be done by randomly sampling the full population to identify a subset of members who will be used in the trial. This subset is randomly split into study and control groups.
  2. Randomized control trial with selection criteria. When performing a study on a smaller block of business or when measuring infrequent events, one alternative method is to evaluate the success of a program on a high-risk subset of the population where it is hypothesized the treatment will have the largest impact. This approach is similar to RCT in that the researcher randomly splits the study and control groups. However, instead of randomly sampling the groups from the full population, they are sampled from a stratified subset of the population (e.g., those at highest risk of having a claim).

Designing Studies

Randomized Control Trial

The business case for many (but not all) LTCI health improvement programs relies on the ability to measure this improvement. These programs are intended to produce positive effects in the health of LTCI policyholders, which may be realized in many ways, including (but not limited to) the following:

  • New LTC claims may be delayed.
  • LTC claimants may have less severe claim journeys; for instance, by delaying a transition from receiving care in the home to receiving care in assisted living or a nursing home or utilizing lower-cost benefits due to improved overall health.
  • LTC claim recoveries may improve.
  • Mortality may decrease as overall health improves.

In each case the researcher will want to define what program success looks like. For instance, if the goal of the program is to delay a policyholder’s transition to facility care, perhaps success can be measured in changes to the rates of transitions (and recoveries). In each case, the nature and quantity of events that the researcher is measuring is key to understanding if the program is effecting change in the population.

If the events are relatively infrequent (such as LTC claim incidence or transitions), a large study will be needed to recognize a meaningful difference in the groups’ responses to the program. A useful metric to determine whether a study may measure this difference is the power of the study. If our study is intended to measure where there is a difference in outcomes between two groups, our null hypothesis may be that there is no difference in outcomes. A type II error occurs when we incorrectly fail to reject the null hypothesis—that is, there is a difference we do not detect. The power of a study is the probability of not making the type II error, correctly rejecting the null hypothesis when the alternative is true.

As an example applied to LTCI policyholders, imagine that you have two groups of active policyholders, one that has been included in a health improvement program and one that has not. You would like to know whether the claim incidence rates for the group included in the program are lower than those of the other group. At a significance level of 0.10, you would like to be 80 percent certain that you will detect an actual difference in incidence rates between the groups (the power of the hypothesis test). Your baseline claim incidence rates are around 2.50 percent, and if you would like to be able to detect a difference of 25 basis points (e.g., the difference between an incidence rate of 2.25 percent and 2.50 percent), you would need 33,447 active lives (samples) in each group.[2]

With a large enough population, selecting a control and a study group at random will suffice to demonstrate differences in the observed outcomes. Because researchers rarely have groups of sufficient size for these studies (for multiple reasons, including that implementation is costly), they can select random policyholders with the constraint that the aggregate risk levels and demographic profiles of each group are somewhat balanced, using a stratified random sample to split the groups.

For higher frequency events, such as claim terminations, a study will require fewer disabled policyholders to reach the same power. Given that we expect approximately one third of claims to terminate within a year, we can predict about how large a study and a control group we need to be relatively sure that we will correctly detect (using the power analysis) differences in one-year claim termination rates. However, this approach can be constrained by the reality that few LTCI carriers have large enough mature blocks to produce the requisite number of new claims required to reach a certain power.

This approach—designing a study to reach a certain power level under tight confidence definitions with an RCT—can be thought of as the gold standard. It is rare that studies can be designed in this way and executed without interruption. In particular, external factors (such as the COVID-19 pandemic) abound that will cause results to deviate, and what we may initially think is a random selection across certain variables is no longer random.

Randomized Control Trial with Selection Criteria

An RCT with selection criteria employs the concept of randomness in the same way as the RCT method already discussed, but it allows the researcher to isolate smaller (presumably more accessible and therefore less costly) groups to use in testing a hypothesis. The RCT with selection criteria method relies on the fact that certain policyholders are more likely to trigger an event and researchers have the information necessary to stratify the population to ascertain different levels of risk that can be used as selection criteria for the study.

The researcher will stratify the population to select policyholders for inclusion in the study who have a higher likelihood of triggering the event. In this way, a group of similar size to that in an RCT sample would produce more triggering events—and by extension, a greater expected difference in the study and control group outcomes. Thus, a smaller overall group is needed to produce the number of events the researcher would like to make statements of confidence.

If we knew more information about these policyholders’ risks, we could stratify their risk profiles and select the top 30 percent of risks that have (as an example) a 12 percent likelihood of claiming. That additional information could come in many forms: whether their spouse has had a long-term care claim recently (or is deceased), whether the person has been hospitalized recently, or whether they live in a single- or two-story home.

Assuming a baseline incidence rate of 12 percent, rather than the 2.5 percent assumed previously, we can recalculate the necessary group sizes. Using the power calculation above, we may measure a 120–basis point difference in incidence rates, which would require a sample size of 6,324 people in each group to be 80 percent sure that the difference in observed rates is not random.

The difficult part of creating high-risk stratified random samples is collecting the appropriate information to create segments of risks. There are several data and analytics products available for insurers to obtain and evaluate these LTC risks.

Actual and Expected

In theory, an RCT that tests the difference in outcomes between two groups, given sufficient power of the expected outcome events, will produce the needed differences in results. Because studies of LTCI health program effectiveness are hard to implement and cannot be rerun easily, researchers should spend substantial time determining the proper group structure that will produce two groups appropriate for comparison.

Conceptually, the study and control groups in an RCT will produce similar results, all else being equal. Implicit in this statement is an assumption that the expected events (or rates, such as claim incidence) are equal. To ensure that two groups can be compared appropriately, rather than hope that the groupings are roughly equal (and therefore should produce similar expected events or rates), the actuary can apply traditional and enhanced techniques to set the expected value for each group and measure how actual experience emerges. This is more salient when the group sizes are smaller, and small differences in populations can result in larger differences in resulting outcomes.

Traditional Expectation Setting

Imagine an actuary selecting two groups to analyze for a population health study: a control and a study group of LTC claimants, each with 1,000 people. The random selection of claimants into two buckets produces claimants of about the same age (say, the average age in each group is 86), but when the actuary reviews the benefit periods in each group, the study group has more lifetime benefit period claimants (60 percent) than the control group (50 percent).

The actuary then sets an expectation of claim termination rates within each group that differs between lifetime benefit claimants and non-lifetime benefit claimants. When measuring the outcomes of the program intervention, the actuary has a different expectation for the study and control groups based on their mix of claimant benefit periods.

Enhanced Expectation Setting

In the prior example, imagine that the actuary selects the study and control groups such that not only are the average ages similar but so are every policy and policyholder dimension found in traditional LTCI admin systems. These include attained age, duration on claim, benefit period, elimination period, product generation, sex, starting situs and others. The actuary has created two very similar groups for all intents and purposes.

Adding even more information—as available and compliant with regulations and laws—can further refine the expected outcomes of each group. For instance, because claimants have filed a HIPAA authorization to receive benefits, the company can now access prescription drug and medical records for each person in the study and control groups. The actuary can use that information to conduct a study that determines the impact that certain drug and medical events have on LTCI claims or use that information to select more similar control and study groups.

If study and control groups are already selected, the actuary can use the additional information to refine the expected events or rates and allow for a more accurate measurement of the program.

This refinement of expected values is particularly salient when measuring programs that may produce very small (yet real) changes in population outcomes.

Actuarial Credibility

Credibility is an important concept to actuaries and should be considered when determining how many policyholders should be included in the study of the effectiveness of a pilot program, as well as when determining if the results of a study are statistically significant. Credibility is a measure of the predictive value of the data.[3] Credibility theory can be complex, and a complete review of credibility procedures is beyond the scope of the article.[4] With that in mind, one way to review the credibility of a population, per Actuarial Standard of Practice (ASOP) 25, is to review the confidence interval[5] and hypothesis test. Actuarial judgment is used to establish the desired level of accuracy, also known as the confidence interval. A stochastic approach such as Monte Carlo simulation or bootstrapping can also help to develop these statistics.[6]

Monte Carlo Method and Bootstrapping

Stochastic methods can help actuaries understand the underlying credibility, process risk and parameter risk of the underlying block when we want to study the results of a program. This will help us to consider whether the differences in results are due to actual program impacts or random variation. Additionally, these methods can be used to help determine how large a study should be to identify a statistical difference between the study and control groups. For example, if it is hypothesized that a treatment will reduce claims by 5 percent, we can use these methods to estimate the group size we will need to show that a 5 percent difference isn’t due to random fluctuation. This is another expression of the power estimates made earlier.

To help with this, actuaries can use a variety of stochastic methods, including Monte Carlo simulation or bootstrapping. Monte Carlo methods can simulate the claim incidence, claim termination and other pertinent rates of each member within a population using a randomly generated number. This is then repeated for a specified number of trials to simulate the expected range of results. The Monte Carlo method assumes that each member’s incidence or claim termination is independent. Bootstrapping, instead of generating random numbers like the Monte Carlo simulation, relies on selecting a random sample from a sufficiently large data set. To use this technique, you will need a large data set of members and their claim experience, the population size and the number of trials.[7]


Evaluating the effectiveness of LTC population health programs will be challenging. We intend for the statistical approaches we outline here to help actuaries and business managers understand key factors in designing these programs, with an eye toward measuring their financial and business impact. Establishing these parameters at the beginning of such a program is critical, as we know that other business complexities and environmental factors will add more complexity to the evaluation over time.

Statements of fact and opinions expressed herein are those of the individual authors and are not necessarily those of the Society of Actuaries, the editors, or the respective authors’ employers.

Robert Eaton, FSA, MAAA, is a principal and consulting actuary at Milliman. Robert can be reached at robert.eaton@milliman.com.

Juliet Spector, FSA, MAAA, is a principal and consulting actuary at Milliman. Juliet can be reached at juliet.spector@milliman.com.

Jeff Anderson, FSA, MAAA, is a consulting actuary at Milliman. Jeff can be reached at jeff.anderson@milliman.com.

Joe Long, ASA, MAAA, is a consulting actuary and data scientist at Milliman. Joe can be reached at Joe.Long@milliman.com.

Brian Hartman, Ph.D., is a statistician at Milliman. Brian can be reached at Brian.Hartman@milliman.com.

Missy Gordon, FSA, MAAA, is a principal and consulting actuary at Milliman. Missy can be reached at missy.gordon@milliman.com.


[1] Mailman School of Public Health. Propensity score analysis, Columbia Public Health, updated Jan. 10, 2023, https://www.publichealth.columbia.edu/research/population-health-methods/propensity-score-analysis (accessed Jan. 23, 2023).

[2] Lisa Sullivan. Power and sample size determination, Boston University School of Public Health, n.d., https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_power/bs704_power_print.html(accessed Jan. 23, 2023).

[3] Actuarial Standards Board. ASOP no. 25: Credibility procedures (revision, second exposure draft), ASB, revised June 2013, http://www.actuarialstandardsboard.org/asops/credibility-procedures-3/ (accessed Jan. 23, 2023).

[4] For more information on credibility theory application, please refer to the American Academy of Actuaries’ Long-Term Care Credibility Monograph (https://www.actuary.org/sites/default/files/files/imce/LTC_Credibility_Monograph.pdf) and Credibility Theory Practice Note (https://www.actuary.org/sites/default/files/files/publications/Practice_note_on_applying_credibility_theory_july2008.pdf).

[5] Confidence interval is the probability of an estimate falling within an acceptable range.

[6] Juliet Spector, Cory Gusland and Carol Kim. Insurance risk and its impact on provider shared risk payment models, SOA, Jan. 2018, https://www.soa.org/49345d/globalassets/assets/files/resources/research-report/2018/insurance-risk-impact.pdf  (accessed Jan. 23, 2023).

[7] Spector et al.