Applications of record Linkage Techniques

Applications of record Linkage Techniques

When it comes to record linkage, the challenge is bringing together the right information from several sources. Here's how to make things match up.

Record linkage is simply the bringing together of information from two or more records that is believed to relate to the same entity, e.g., the same individual, the same family or the same business. This may entail the linking of records within a single computer file to identify duplicate records. Alternatively, record linkage may entail the linking of records across two or more files. The challenge lies in bringing together the records for the same individual entities. Such a linkage is known as an exact match. The task is easiest when 1) the files have nearly unique identification numbers (e.g., social security numbers), 2) information is recorded in standardized formats and 3) the files are small. In the absence of nearly unique identifiers, names, addresses, dates-of-birth or other indirect identifiers are frequently used in the matching process.

Why Did Analysts Begin Linking Records?
Computerized record linkage techniques were introduced about 50 years ago. (The Social Security Administration initially used punch cards to do record linkage studies in the 1930s.) This resulted from the confluence of several developments:

  • The expansion of both federal benefit programs and federal taxation systems after World War II forced federal governments to maintain large numbers of records on individual citizens and businesses.
  • New computer technology dramatically increased the ability of federal governments to maintain large computerized databases as well as to analyze them. In particular, this technology enabled researchers to merge two or more databases and to extract complex information from them at will.
  • The expansion of the role of the government resulted in a large demand for detailed information that could often be met at less cost by extracting the information from existing administrative files rather than from conducting new studies that required the collection of additional information from the public.

On the other hand, the governments' appetite for more information was met, in some countries, by public concern that it posed a major threat to individual privacy.

General Applications
There are three main types of applications of record linkage. In the first, two or more databases are merged to produce a database that has data fields that are not on any other single database. In the second, two or more databases are merged to improve the quality of the data on one of the databases. In the third, record linkages techniques are used to improve the quality of a single database. Following are examples of each of these applications.

Disabled Airplane Pilots–A Successful Application of Record Linkage
The following example shows how record linkage techniques can be used to detect fraud, waste or abuse of federal government programs. Here, two databases were merged to get information not previously available from a single database.

A database consisting of records on 40,000 airplane pilots licensed by the U.S. Federal Aviation Administration (FAA) and residing in Northern California was matched to a database consisting of individuals receiving disability payments from the Social Security Administration. Forty pilots whose records turned up on both databases were arrested. A prosecutor in the U.S. Attorney's Office in Fresno, California stated, according to an AP [2005] report, "there was probably criminal wrongdoing." The pilots were "either lying to the FAA or wrongfully receiving benefits.

"The pilots...claimed to be medically fit to fly airplanes. However, they may have been flying with debilitating illnesses that should have kept them grounded, ranging from schizophrenia and bipolar disorder to drug and alcohol addiction and heart conditions."

At least twelve of these individuals "had commercial or airline transport licenses," the report stated.

"The FAA...revoked 14 pilots' licenses." The "other pilots...were found to be lying about having illnesses in order to collect Social Security payments," said the report.

The quality of the linkage of the files was highly dependent on the quality of the names and addresses of the licensed pilots within both of the files being linked. The detection of the fraud was also dependent on the completeness and accuracy of the information in a particular Social Security Administration database.

CPS-IRS-SSA Exact Match Study
During the 1970s, the U.S. Bureau of the Census and the Social Security Administration (SSA) carried out jointly the 1973 CPS-SSA-IRS Exact Match Study. The goals of this study included the following:

  1. Improve policy simulation models of the tax-transfer system.
  2. Study the effects of alternative ways of pricing social security benefits.
  3. Examine age reporting differences among matched sources.
  4. Summarize lifetime covered earnings patterns of persons contributing to social security.
  5. Obtain additional information about non-covered earnings, i.e., those not subject to social security taxes.

The primary source of data for this study was the U.S. Census Bureau's March 1973 Current Population Survey (CPS). The study attempted to link the survey records of all of the individuals included in the March 1973 CPS to 1) the earnings and benefit information contained in their SSA administrative records1 and 2) selected items from their 1972 Internal Revenue Service (IRS) individual income tax returns.2

The CPS is a monthly household survey of the entire civilian non–institutionalized population of the United States, i.e., the 50 states and the District of Columbia. Members of the armed forces are included if they are living 1) on post with their families or 2) off post. The CPS collects demographic data as well as data on work experience, income and family composition. The March 1973 CPS consisted of about 50,000 households and more than 100,000 individuals ages 14 or older within those households.

So, the idea here was to combine the demographic data from the CPS with the more reliable income data from the IRS and the SSA.

Duplicate Single–Family Mortgage Records
The ABC Mortgage Guarantee Insurance Company (this is not a real company) was founded in 1970. Since its inception, it has insured over 40 million single–family mortgages under its Single–Family Insurance Fund (SFIF).

Unfortunately, some mortgages have been entered into ABC's single–family data warehouse under two or more identification numbers, usually with only slight differences between the identification numbers and few, if any, differences in the case data. When such records are displayed together as in the following table, it is usually apparent that both represent the same mortgage; however, because the identification numbers are different, the data warehouse treats both records as unique, individual mortgages.

Here, the two records match on the serial portion of their ABC identification numbers (113261), as well as on their initial mortgage amounts ($33,000). It is clear that they also match on their begin amortization dates, interest rates, property addresses and borrower name(s).

A much more pervasive problem is one of incorrect termination status. When a borrower refinances or prepays an ABC–insured single–family mortgage, the lender servicing that mortgage is supposed to so notify ABC, and ABC in turn should make the appropriate changes to its databases. But this process does not always work as intended. As a consequence, ABC has hundreds of thousands of mortgage records in its single–family data warehouse with a termination status of active when, in fact, the underlying mortgage has been refinanced, paid in full or terminated for some other reason. Because 1) ABC only insures first mortgages, i.e., those with a primary lien on the underlying property, and 2) no condominium units are supposed to be insured under the SFIF, it follows that there should be at most one active SFIF–insured mortgage per property address. Too often, this is not the case.

In this instance, the data warehouse lists two active mortgages on a property located in Queens Village, New York, where in reality the first mortgage that was originated in March of 1982 at an interest rate of 16.5 percent has been refinanced at least once, i.e., during September 2004.

Here again, we used a (deterministic) exact–match scheme. We first blocked the data based on the first four digits of the zip code of the property address. So, we excluded all records that lacked a zip code. For each remaining record, we created a string consisting of 18 characters. The first part of the string consisted of the first 10 alphanumeric characters of the street address of the insured property. The next four characters of the string were the first four alphabetic characters of the name of the borrower. The next two characters represented the month the loan began amortizing while the last two characters represented the last two digits of the year the loan began amortizing.

We then identified pairs of mortgage records having duplicate strings and considered these to be duplicate records and deleted the erroneous records from our database. We also deleted the last four characters from the remaining strings, and again identified pairs of mortgage records having duplicate strings. The hope was that these represented pairs of mortgages with identical property addresses and borrowers. We considered most of these to represent ABC–insured mortgages that refinanced into new ABC–insured mortgages. If the loan record with the earlier begin–amortization–date was still listed as being an active loan, we changed the termination status of that loan record to termination by prepayment. In most instances, we added the begin amortization date of the later loan as the termination date of the earlier one.

Finally, we deleted the last four characters of the rest of the string, leaving us with only the 10 characters from the property address field. We again identified pairs of mortgage records having duplicate strings. The hope was that these represented pairs of mortgages with identical property addresses. We considered many of these to represent ABC–insured mortgages on houses that were sold and where the new homebuyer also used ABC insurance. These pairs of records required staff review. If the loan record with the earlier begin–amortization–date was still listed as being an active loan and other criteria were met as well, we changed the termination status of that loan record to "termination by prepayment." In most instances, we used the date the later mortgage began amortizing as the termination date of the earlier mortgage.

Deterministic Record Linkage
In deterministic record linkage, a pair of records is said to be a link if the two records agree exactly on each element within a collection of identifiers called the match key. For example, when comparing two records on last name, street name, year of birth and street number, the pair of records is deemed to be a link only if the names agree on all characters, the years of birth are the same and the street numbers are identical. In the ABC Mortgage Company example, we say we have a match if the two strings to be compared are identical. In many instances this criterion is too strict; too many actual matches are missed and too many duplicate records remain on the database.

Probabilistic Record Linkage
Probabilistic record linkage is a way to obtain a less stringent matching criterion. Fellegi and Sunter [1969] developed a record linkage model based on earlier work of Newcombe [1959]. Under the Fellegi–Sunter model, pairs of records are treated as links, possible links or non–links. In the terminology of statistics, the problem of choosing one status from among link, possible link and non–link can be viewed as a double–hypothesis testing problem.

Record linkage models that are based on the Bayesian paradigm of statistics are described in Belin and Rubin [1995], Larsen [2004], and McGlincy [2006].

The authors' forthcoming text Data Quality and Record Linkage Techniques to be published during 2007 by Springer is a comprehensive treatment of record linkage models and related tools.

References
Belin, T.R. and D.B. Rubin, "A Method for Calibrating False Matches in Record Linkage," Journal of the American Statistical Association, Vol.90, pages 694–707, 1995.

Curtis, K., Pilots Arrested For Disability Payments, The Associated Press, July 25, 2005.

Fellegi, I.P. and A.B. Sunter, "A Theory of Record Linkage," Journal of the American Statistical Association, Vol.64, pages 1183–1210, 1969.

Herzog, T.N., F.J. Scheuren, and W.E. Winkler, Data Quality and Record Linkage Techniques, New York: Springer, 2007 (to appear).

Larsen, M.D., "Record Linkage Using Finite Mixture Models," in Applied Bayesian Modeling and Causal Inference from Incomplete Data Perspectives, ed. A. Gelman and X Meng, New York: John Wiley & Sons, 2004.

McGlincy, M.H., Using Test Databases to Evaluate Record Linkage Models and Train linkage Practitioners, paper presented at 2006 annual meeting of the American Statistical Association (to appear).

Author Information
Thomas N. Herzog, PhD, ASA, is chief actuary at the Federal Housing Administration in Washington, D.C. He can be reached at Thomas_N._Herzog@hud.gov

Fritz J. Scheuren, PhD, is vice president for statistics at National Opinion Research Center. He is the 100th President of the American Statistical Association and a Fellow of both the American Statistical Association and the American Association for the Advancement of Science.

William E. Winkler, PhD, is principal researcher at the U.S. Census Bureau in Suitland, MD. He is a fellow of the American Statistical Association.

Footnotes

  1. Specifically, this information was taken from SSA's Summary Earnings Record (SER) and Master Beneficiary Record (MBR) databases.
  2. More precisely, this information was taken for the IRS Individual Master Tax File for 1972.