FIND DATA: By Author | Journal | Sites   ANALYZE DATA: Help with R | SPSS | Stata | Excel   WHAT'S NEW? US Politics | Int'l Relations | Law & Courts
   FIND DATA: By Author | Journal | Sites   WHAT'S NEW? US Politics | IR | Law & Courts
If this link is broken, please report as broken. You can also submit updates (will be reviewed).
Google Translate Works for Bag-of-Words Text Analysis
Insights from the Field
machine translation
bag-of-words
topic models
Europarl
LDA
Methodology
Pol. An.
1 other files
Dataverse
No Longer Lost in Translation: Evidence That Google Translate Works for Comparative Bag-of-Words Text Applications was authored by Erik De Vries, Martijn Schoonvelde and Gijs Schumacher. It was published by Cambridge in Pol. An. in 2018.

Comparative text analysis faces a basic hurdle: texts are written in different languages. Some researchers have proposed translating all texts into English using Google Translate before analysis. This study tests whether that shortcut undermines bag-of-words approaches such as topic models or whether machine translation preserves the features scholars rely on.

🔍 What Was Compared

  • Two versions of the same multilingual corpus (Europarl): a gold-standard human-translated English corpus and a machine-translated English corpus produced by Google Translate.
  • Two analytical outputs: term–document matrices (TDMs) and Latent Dirichlet Allocation (LDA) topic models.
  • Evaluation at both the document level and the overall corpus level to capture fine-grained and aggregate effects.

🧪 Key Findings

  • TDMs from human-translated and machine-translated texts are highly similar, with only minor differences across languages.
  • A substantial proportion of features (terms) overlap between the gold-standard and machine-translated corpora.
  • LDA topic models show strong resemblance in both topical prevalence (how common topics are) and topical content (what topics look like), again with only small cross-language differences.

💡 Why It Matters

  • For researchers using bag-of-words techniques, Google Translate provides a practical and reliable way to harmonize multilingual corpora without substantially distorting term-level or topic-level results.
  • These results support the practice of translating non-English texts into English for comparative bag-of-words applications, while acknowledging small language-specific deviations that merit caution in sensitive or high-stakes contexts.
data
Find on Google Scholar
Find on JSTOR
Find on CUP
Political Analysis
Podcast host Ryan