Why Plugging ML Predictions Into Regressions Misleads

Insights from the Field

missing data

machine learning

regression

prediction error

text analysis

Machine Learning Predictions as Regression Covariates was authored by Christian Fong and Matthew Tyler. It was published by Cambridge in Pol. An. in 2021.

🔍 The Challenge — Missing Covariates Are Everywhere

Datasets built from text, images, merged surveys, and voter files often lack key covariates because those features are latent (for example, sentiment in text) or simply not collected (for example, race in voter files).

⚠️ A Common Shortcut — And Why It Fails

A widespread approach is to hand-label the true covariate for a subset of observations, train a machine learning model to predict that covariate for the remainder, and then plug those predictions into regressions. Doing so without accounting for prediction error leads to biased, inconsistent, and overconfident inference.

🛠️ A Practical Fix That Restores Validity

This work characterizes how severe the problems from prediction error can be and describes a procedure that avoids these inconsistencies under comparatively general assumptions. Key features:

Explicitly accounts for prediction error when using predicted covariates in regressions
Produces consistent estimates and honest uncertainty quantification under broad conditions

🔬 How the Method Is Evaluated

Performance is demonstrated through:

Simulation studies that evaluate estimator behavior across scenarios
An applied study of hostile political dialogue on the Internet that tests real-world performance

💾 Tools for Applied Researchers

Software implementing the proposed approach is provided to facilitate adoption.

📈 Why It Matters

When machine learning is used to impute missing covariates across text, images, merged surveys, or voter files, naive plug-in of predictions into regressions can produce misleading results. The proposed procedure enables valid effect estimation and accurate uncertainty reporting in such settings.