Categorical Data Favor Conditional Multiple Imputation Over MVN

Insights from the Field

multiple imputation

multivariate normal

conditional imputation

categorical data

ANES

Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches was authored by Jonathan Kropko, Ben Goodrich, Andrew Gelman and Jennifer Hill. It was published by Cambridge in Pol. An. in 2014.

📌 What Was Compared

This study compares two common multiple imputation (MI) approaches: joint multivariate normal (MVN) MI, which models the complete data as a sample from a joint multivariate normal distribution and typically treats discrete categories as probabilistic draws from underlying continuous values; and conditional MI, which models each variable conditional on all others.

📊 How The Comparison Was Done

Two performance targets were assessed:
Accuracy of the imputed values themselves
Accuracy of coefficients and fitted values from analysis models run on completed datasets

Simulations covered a range of variable types:
Continuous, binary, ordinal, and unordered-categorical

Two simulation sources were used:
Synthetic data drawn from a multivariate normal distribution
Realistic data drawn from the 2008 American National Election Studies (ANES)

Missingness was generated by carefully following the conditions necessary for missingness to be Missing At Random (MAR), a less restrictive and more realistic setup than often used in missing-data simulation studies.

🔍 What Was Found

In these simulations, conditional MI produced more accurate imputations and more accurate analysis results than joint MVN MI whenever the dataset included categorical variables.

The advantage of conditional MI held across both the MVN-generated data and the ANES-based simulations, and across binary, ordinal, and unordered-categorical variables.

Joint MVN MI’s common practice of treating discrete outcomes as derived from continuous latent values appears to undercut its accuracy when categorical variables are present.

💡 Why It Matters

These results imply that applied researchers using MI on datasets with any categorical variables should favor conditional imputation approaches over standard joint MVN implementations, because conditional MI yields more accurate imputations and downstream inferences under realistic MAR conditions.