CrowS-Pairs - A Challenge Dataset for Measuring Social Biases in Masked Language Models

Summary

This paper introduces CrowS-Pairs (Crowdsourced Stereotype Pairs), a challenge dataset for measuring social biases in masked language models (MLMs). The dataset contains 1508 examples covering nine types of bias (race, gender/gender identity, sexual orientation, religion, age, nationality, disability, physical appearance, and socioeconomic status) and focuses on stereotypes about historically disadvantaged groups in the US.

Key Contributions

Creation of a crowdsourced dataset specifically designed to measure social biases in MLMs
Development of a metric to measure model bias while controlling for word frequency effects
Comprehensive evaluation of three popular MLMs (BERT, RoBERTa, ALBERT) showing substantial bias across all categories
Comparison with existing bias measurement datasets showing higher validation rates

Methodology

Dataset Creation:
- Crowdworkers write pairs of minimally distant sentences where one involves a historically disadvantaged group
- Each example can either demonstrate a stereotype or violate it (anti-stereotype)
- Extensive validation process with 5 additional annotators per example
- 80% inter-annotator agreement on final dataset
Bias Measurement:
- Uses pseudo-log-likelihood scoring for MLMs
- Controls for word frequency effects by conditioning on modified tokens
- Measures percentage of examples where model prefers stereotyping sentence
- Ideal unbiased score would be 50%

Key Findings

All evaluated MLMs (BERT, RoBERTa, ALBERT) show substantial bias:
- BERT: 60.5%
- RoBERTa: 64.1%
- ALBERT: 67.0%
Bias varies across categories:
- Religion shows highest bias scores
- Gender and race categories show comparatively lower bias
- Models show less bias on anti-stereotype examples
Model size and training data impact:
- Larger models tend to show more bias
- Models trained on web content show higher bias than those trained on curated sources

Limitations

Dataset is specific to US cultural context
Limited to nine categories of bias
Cannot be used for autoregressive language models
May not capture all forms of bias or stereotypes
Risk of dataset being misused to claim complete model debiasing

Future Directions

Development of metrics for autoregressive language models
Creation of training data for debiasing
Investigation of debiasing methods that don’t harm model performance
Extension to other cultural contexts and bias types
Development of more comprehensive bias evaluation frameworks

Personal Notes

The paper is particularly notable for its careful methodology in dataset creation and validation
Important contribution to responsible AI development by providing concrete ways to measure harmful biases
The finding that larger models show more bias raises important questions about scaling and responsible AI
The paper’s ethical considerations section shows thoughtful engagement with potential misuse

📖 tl;dr Responsible AI

Explorer

CrowS-Pairs - A Challenge Dataset for Measuring Social Biases in Masked Language Models

Summary

Key Contributions

Methodology

Key Findings

Limitations

Future Directions

Personal Notes

Recent

Sociotechnical Safety Evaluation of Generative AI Systems

The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Summaries of RAI concepts, research, and frameworks

AART AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

Table of Contents