๐ What Was Studied
Supervised machine learning is increasingly used in political science, but these models require costly manual labeling of documents. Active learning is a framework that targets which documents human coders label rather than selecting them at random, with the goal of minimizing the amount of labeled data needed to train a model.
๐งพ How Text Was Tested
- Simulation studies run on three distinct text corpora that vary in size, document length, and domain.
- Comparisons made between active learning procedures and random ("passive") sampling for assembling labeled training sets.
- Experiments also vary intercoder reliability to gauge how coder disagreement affects performance.
๐ Key Findings
- Active learning can substantially reduce the labeling burden for text data compared to random sampling.
- When the document class of interest is imbalanced (i.e., not balanced across classes), active learning often requires only a fraction of the documents that random sampling needs to produce classifiers with equivalent performance.
- Even under conditions of low intercoder reliability, active learning procedures remain more efficient than random sampling in producing effective classifiers.
โ๏ธ Why It Matters
- Active learning offers a practical way to cut survey and annotation costs for researchers using text classification in political science.
- The approach is especially valuable for studies focused on rare or unevenly distributed document categories and remains advantageous when human coders disagree.
- These results provide guidance for researcher decisions about sampling strategies and resource allocation when building labeled training data for text-based supervised learning.