FIND DATA: By Author | Journal | Sites   ANALYZE DATA: Help with R | SPSS | Stata | Excel   WHAT'S NEW? US Politics | Int'l Relations | Law & Courts
   FIND DATA: By Author | Journal | Sites   WHAT'S NEW? US Politics | IR | Law & Courts
If this link is broken, please report as broken. You can also submit updates (will be reviewed).
Why Plugging ML Predictions Into Regressions Misleads — And How to Fix It
Insights from the Field
missing data
machine learning
regression
prediction error
text analysis
Methodology
Pol. An.
10 R files
1 Stata files
7 text files
14 other files
9 PDF files
27 datasets
Dataverse
Machine Learning Predictions as Regression Covariates was authored by Christian Fong and Matthew Tyler. It was published by Cambridge in Pol. An. in 2021.

🔍 The Challenge — Missing Covariates Are Everywhere

Datasets built from text, images, merged surveys, and voter files often lack key covariates because those features are latent (for example, sentiment in text) or simply not collected (for example, race in voter files).

⚠️ A Common Shortcut — And Why It Fails

A widespread approach is to hand-label the true covariate for a subset of observations, train a machine learning model to predict that covariate for the remainder, and then plug those predictions into regressions. Doing so without accounting for prediction error leads to biased, inconsistent, and overconfident inference.

🛠️ A Practical Fix That Restores Validity

This work characterizes how severe the problems from prediction error can be and describes a procedure that avoids these inconsistencies under comparatively general assumptions. Key features:

  • Explicitly accounts for prediction error when using predicted covariates in regressions
  • Produces consistent estimates and honest uncertainty quantification under broad conditions

🔬 How the Method Is Evaluated

Performance is demonstrated through:

  • Simulation studies that evaluate estimator behavior across scenarios
  • An applied study of hostile political dialogue on the Internet that tests real-world performance

💾 Tools for Applied Researchers

Software implementing the proposed approach is provided to facilitate adoption.

📈 Why It Matters

When machine learning is used to impute missing covariates across text, images, merged surveys, or voter files, naive plug-in of predictions into regressions can produce misleading results. The proposed procedure enables valid effect estimation and accurate uncertainty reporting in such settings.

data
Find on Google Scholar
Find on JSTOR
Find on CUP
Political Analysis
Podcast host Ryan