Why Is Health Care Data So Challenging?

by David Cusick and Nathan Mathews

Data is the lifeblood of the health care business, giving insight to an organization's current status and thus helping actuaries and other analysts make informed decisions and projections on a daily basis. Because of the complexity and size of health care data, working with it can be frustrating and challenging as each new data set has unique issues and problems. In addition, health care data is a rapidly evolving and difficult-to-understand breed of business data.

However, there are useful strategies and tools to make health care data easier to interpret and work with. Data problems throughout the health care industry often stem from one of four origins:

misuse of data sources,
changing coding schemes,
non-standard time dimensions, and
the ongoing status of claims.

Becoming more aware of these four potential problems may increase understanding and the ability to use available data more efficiently.

Common Problems

Data Sources

No first look at a set of claims is ever enough to understand the complexity of the data. Using reference tables is necessary to understand the information contained in claims data. Claims data uses reference codes to save space and quickly convey information. If the data analyst does not take the time to use the reference codes, there is a danger of misinterpreting the data. However, using reference tables can add additional setup and processing time.

An example of data where these references are used is the Universal Billing Form 92 (UB-92), the official form used by hospitals and health care centers when submitting bills to Medicare and other payors for reimbursement. The reference codes used on the UB-92 form include: patient status, condition codes, revenue codes, HCPCS or CPT procedure codes, patient relationship codes, employment status codes, and diagnosis codes. Data analysts have to be very careful about accuracy and timeliness in reading reference codes because updates can create multiple versions for one data set.

Changing Code Schemes

In addition to paying attention to reference codes, data analysts have to be aware of the layout of the data before they begin loading. Code schemes or data formats are updated and changed on a fairly regular basis. One large-scale example is the conversion from the International Classification of Diseases (ICD) 9th standard to its 10th (ICD-10). On a smaller scale, health care carriers and pharmacy benefit managers (PBMs) make frequent incremental changes, such as adding new fields, dropping old ones, and updating field names. This can cause disruptions to existing load programs and/or summary queries.

Pharmacy pricing is a good example of data that is frequently updated. Average Wholesale Price (AWP) data, published by Medispan and FirstDataBank, are updated on a weekly basis. Not all drug prices are updated on a weekly basis, but regardless, the pricing reference table changes frequently. The carriers and PBMs typically send out notices of upcoming changes, but changes can come as a surprise if the data from a particular carrier or PBM has not recently been dealt with. Before beginning any data loading it is always a good idea to check the sources of reference data to be sure to have the correct versions to match the raw data.

Non-Standard Time Dimension

Another unstable field in data is the time dimension. Different medical data sets may contain different dates; the date supplied in the data may be based on the date the claim was incurred or on the date the claim was paid. In some cases only a year and month is indicated, whereas in others it may be a year, month, and day. Some systems base their transactional date on when the claim was last adjudicated for payment instead of the actual date paid. It is important to be aware of the types of dates (paid versus service date) when trying to keep consistencies between different data sets.

No Transaction Is Ever Final

Claims data can be in a constant state of flux as claims are reversed or adjusted well after the initial transaction is recorded. Typically, health care data is set up to be examined as a snapshot in time or as a transactional system, but not both within the same system. For example, a client might ask for summaries of the cost for 2010 brand and generic drugs by month, or the paid claim count for certain CPT codes during the month of March. However, when loading data on a consistent basis from a transactional system into a data warehouse environment, it is a critical concern for consultants to ensure that data is linked correctly between the two systems so that the dollar amounts are correct in the data warehouse.

Suggested Best Practice

One of the best ways to deal with these four related issues is to define an internal standardized format for using various data types. Standardized formats provide a fixed set of defined fields and data types that fit most needs for analysis and reporting. All incoming data is loaded into defined formats before doing any analysis.

One set of standardized formats will not meet the needs of every data user. We suggest sitting down with your team and creating a standard format that will fit your data needs. Be sure all reference tables are readily accessible and determine which data fields will be consistent and which will be changing over time, including the frequency of change. These standard formats should take into account future needs and have all data fields that may be needed for analysis.

Once a standard format is created, data can be more efficiently loaded using a load program or query, with necessary reference tables attached. As part of the loading process, make sure that any reversed or adjusted claims are accounted for or removed and confirm consistency of dates (incurred or paid). If incoming claim data formats change, the only updating is to the loading program or query. All reports and summary analysis will not need to be changed as they are run from the standard data format, saving hours of time and the headache of trying to update all previous work and analysis piece by piece. For example, our practice produces a summary report for a client from a monthly data file. As the incoming raw data format changes, the load programs are updated to incorporate these changes.

About the Authors:

David Cusick has more than 13 years of data experience and works as a health care technology consultant in the San Diego office of Milliman. He specializes in the design and implementation of data management systems for health care plans and providers. David can be reached at david.cusick@milliman.com or 760-688-6654.

Nathan Mathews has more than six years of data experience and works as a data analyst for the San Diego office of Milliman. His areas of expertise include data warehousing, data analysis, and reporting with a focus on medical and pharmacy data. Nathan can be reached at nathan.mathews@milliman.com or 858-587-5329.