Merging large administrative data sets is challenging due to missing identifiers and inaccurate records. This paper introduces a new algorithm for probabilistic record linkage that handles these issues efficiently at scale.
Data & Methods: We developed a fast, scalable algorithm using a canonical probabilistic model approach. The method accommodates millions of observations while accounting for:
- Missing data
- Measurement error
- Auxiliary information integration
- Uncertainty adjustment in post-analysis
Simulation Studies: Our algorithm was tested extensively through realistic scenarios to ensure reliability.
Real Applications: Case studies demonstrate its use in merging campaign contribution records, survey datasets, and voter files. An open-source implementation is available for researchers.