🔍 The Challenge — Missing Covariates Are Everywhere
Datasets built from text, images, merged surveys, and voter files often lack key covariates because those features are latent (for example, sentiment in text) or simply not collected (for example, race in voter files).
⚠️ A Common Shortcut — And Why It Fails
A widespread approach is to hand-label the true covariate for a subset of observations, train a machine learning model to predict that covariate for the remainder, and then plug those predictions into regressions. Doing so without accounting for prediction error leads to biased, inconsistent, and overconfident inference.
🛠️ A Practical Fix That Restores Validity
This work characterizes how severe the problems from prediction error can be and describes a procedure that avoids these inconsistencies under comparatively general assumptions. Key features:
- Explicitly accounts for prediction error when using predicted covariates in regressions
- Produces consistent estimates and honest uncertainty quantification under broad conditions
🔬 How the Method Is Evaluated
Performance is demonstrated through:
- Simulation studies that evaluate estimator behavior across scenarios
- An applied study of hostile political dialogue on the Internet that tests real-world performance
💾 Tools for Applied Researchers
Software implementing the proposed approach is provided to facilitate adoption.
📈 Why It Matters
When machine learning is used to impute missing covariates across text, images, merged surveys, or voter files, naive plug-in of predictions into regressions can produce misleading results. The proposed procedure enables valid effect estimation and accurate uncertainty reporting in such settings.