Comparative text analysis faces a basic hurdle: texts are written in different languages. Some researchers have proposed translating all texts into English using Google Translate before analysis. This study tests whether that shortcut undermines bag-of-words approaches such as topic models or whether machine translation preserves the features scholars rely on.
🔍 What Was Compared
- Two versions of the same multilingual corpus (Europarl): a gold-standard human-translated English corpus and a machine-translated English corpus produced by Google Translate.
- Two analytical outputs: term–document matrices (TDMs) and Latent Dirichlet Allocation (LDA) topic models.
- Evaluation at both the document level and the overall corpus level to capture fine-grained and aggregate effects.
🧪 Key Findings
- TDMs from human-translated and machine-translated texts are highly similar, with only minor differences across languages.
- A substantial proportion of features (terms) overlap between the gold-standard and machine-translated corpora.
- LDA topic models show strong resemblance in both topical prevalence (how common topics are) and topical content (what topics look like), again with only small cross-language differences.
💡 Why It Matters
- For researchers using bag-of-words techniques, Google Translate provides a practical and reliable way to harmonize multilingual corpora without substantially distorting term-level or topic-level results.
- These results support the practice of translating non-English texts into English for comparative bag-of-words applications, while acknowledging small language-specific deviations that merit caution in sensitive or high-stakes contexts.






