FIND DATA: By Author | Journal | Sites   ANALYZE DATA: Help with R | SPSS | Stata | Excel   WHAT'S NEW? US Politics | Int'l Relations | Law & Courts
   FIND DATA: By Author | Journal | Sites   WHAT'S NEW? US Politics | IR | Law & Courts
If this link is broken, please report as broken. You can also submit updates (will be reviewed).
Active Learning Cuts Text Labeling Costs, Especially for Rare Categories
Insights from the Field
active learning
text classification
intercoder reliability
simulation
Methodology
Pol. An.
1 archives
Dataverse
Active Learning Approaches for Labeling Text: Review and Assessment of the Performance of Active Learning Approaches was authored by Blake Miller, Fridolin Linder and Walter Mebane. It was published by Cambridge in Pol. An. in 2020.

๐Ÿ”Ž What Was Studied

Supervised machine learning is increasingly used in political science, but these models require costly manual labeling of documents. Active learning is a framework that targets which documents human coders label rather than selecting them at random, with the goal of minimizing the amount of labeled data needed to train a model.

๐Ÿงพ How Text Was Tested

  • Simulation studies run on three distinct text corpora that vary in size, document length, and domain.
  • Comparisons made between active learning procedures and random ("passive") sampling for assembling labeled training sets.
  • Experiments also vary intercoder reliability to gauge how coder disagreement affects performance.

๐Ÿ“Š Key Findings

  • Active learning can substantially reduce the labeling burden for text data compared to random sampling.
  • When the document class of interest is imbalanced (i.e., not balanced across classes), active learning often requires only a fraction of the documents that random sampling needs to produce classifiers with equivalent performance.
  • Even under conditions of low intercoder reliability, active learning procedures remain more efficient than random sampling in producing effective classifiers.

โš™๏ธ Why It Matters

  • Active learning offers a practical way to cut survey and annotation costs for researchers using text classification in political science.
  • The approach is especially valuable for studies focused on rare or unevenly distributed document categories and remains advantageous when human coders disagree.
  • These results provide guidance for researcher decisions about sampling strategies and resource allocation when building labeled training data for text-based supervised learning.
data
Find on Google Scholar
Find on JSTOR
Find on CUP
Political Analysis
Podcast host Ryan