Machine vs Human Geocoding: Similar Validity, Different Errors in Colombian Rights Data

Insights from the Field

geocoding

Colombia

human-rights

spatial-probit

INLA

Human Rights Violations in Space: Assessing the External Validity of Machine Geo-coded vs. Human Geo-coded Data was authored by Logan Stundal, Benjamin Bagozzi, John Freeman and Jennifer Holmes. It was published by Cambridge in Pol. An. in 2022.

🔎 What Was Compared:

This study compares human- and machine-geocoded records of human rights violations in Colombia against an independent ground-truth source to assess external validity. Agreement rates between the two geocoding approaches are evaluated for an eight-year focal period, for three consecutive two-year subperiods, and for a selected set of (non)journalistically remote municipalities.

📊 How The Data Were Tested:

Event type: human rights violations in Colombia.
Temporal scope: one key 8-year period and three 2-year subperiods.
Spatial scope: nationwide with targeted analysis of (non)journalistically remote municipalities.
Benchmark: an independent ground-truth dataset used to measure agreement between human and machine geocodes.

🧮 How The Models Compared Predictive Performance:

Spatial probit models were estimated separately on each of the three datasets to compare predictive patterns. These models incorporate Gaussian Markov Random Field (GMRF) error processes, are constructed via a stochastic partial differential equation (SPDE) approach, and are estimated using integrated nested Laplace approximation (INLA). The models test whether datasets:

Produce comparable predictions;
Underreport events relative to the same covariates; and
Share similar patterns of prediction error.

🔑 Key Findings:

Agreement analysis against the ground truth shows that machine- and human-geocoded datasets are comparable in terms of external validity for this subnational conflict.
Geostatistical (spatial probit) models reveal that prediction errors differ in important respects across the datasets, indicating distinct spatial error structures despite similar overall validity.

🌍 Why It Matters:

These results caution researchers and practitioners: machine-geocoded event data can be externally valid at the subnational level, but spatially structured prediction errors may affect inference and mapping of conflict risk. Choosing between human and machine geocoding should consider not only agreement with ground truth but also how geocoding method shapes spatial error patterns and subsequent model-based predictions.