Summary
This paper introduces STAR (SocioTechnical Approach to Red teaming), a framework that improves red teaming of large language models through two key innovations:
- Enhanced steerability via parameterized instructions for red teamers
- Improved signal quality through demographic matching and arbitration
Key Contributions
Steerability Improvements
- Uses procedurally generated instructions with multiple parameters
- Parameters include: rule to test, adversariality level, use case, topic, and demographic targets
- Enables more systematic coverage of the risk surface
- Allows for targeted exploration of specific failure modes
Signal Quality Enhancements
- Matches annotator demographics to assessed harms for specific groups
- Employs expert annotators (fact-checkers, medical professionals) for relevant domains
- Introduces arbitration step to leverage diverse viewpoints
- Treats annotator disagreement as valuable signal rather than noise
Methodology
- Collected 8,360 dialogues from 225 red teamers
- Annotated by 286 annotators/arbitrators
- Used demographic matching: 50% in-group vs 50% out-group attackers
- Two-stage annotation pipeline with arbitration for disagreements
- Analyzed results using UMAP embeddings and cluster analysis
Key Findings
- Achieved more even coverage of attack surface compared to previous approaches
- In-group annotators more likely to identify hate speech violations (45% vs 30%)
- Arbitration process provided valuable insights into annotation disagreements
- Revealed complex interactions in model failures across intersectional identities
Limitations
- Limited to 5 parameters due to cognitive load on human raters
- Only tested in English language
- Demographic intersections limited to two-way combinations
- May not fully reflect real-world usage patterns
Impact & Applications
- Provides reproducible framework for red teaming evaluations
- Can be adapted for different contexts, languages, and failure modes
- Useful for both human-led and hybrid automated/human red teaming
- Demonstrates importance of sociotechnical perspective in AI evaluation