📖 tl;dr Responsible AI

❯

Evaluation and Testing

❯

❯

STAR: SocioTechnical Approach to Red Teaming Language Models

STAR: SocioTechnical Approach to Red Teaming Language Models

2 min read

Source: https://arxiv.org/abs/2406.11757

paper-notes
red-teaming
safety
methodology

Summary

This paper introduces STAR (SocioTechnical Approach to Red teaming), a framework that improves red teaming of large language models through two key innovations:

Enhanced steerability via parameterized instructions for red teamers
Improved signal quality through demographic matching and arbitration

Key Contributions

Steerability Improvements

Uses procedurally generated instructions with multiple parameters
Parameters include: rule to test, adversariality level, use case, topic, and demographic targets
Enables more systematic coverage of the risk surface
Allows for targeted exploration of specific failure modes

Signal Quality Enhancements

Matches annotator demographics to assessed harms for specific groups
Employs expert annotators (fact-checkers, medical professionals) for relevant domains
Introduces arbitration step to leverage diverse viewpoints
Treats annotator disagreement as valuable signal rather than noise

Methodology

Collected 8,360 dialogues from 225 red teamers
Annotated by 286 annotators/arbitrators
Used demographic matching: 50% in-group vs 50% out-group attackers
Two-stage annotation pipeline with arbitration for disagreements
Analyzed results using UMAP embeddings and cluster analysis

Key Findings

Achieved more even coverage of attack surface compared to previous approaches
In-group annotators more likely to identify hate speech violations (45% vs 30%)
Arbitration process provided valuable insights into annotation disagreements
Revealed complex interactions in model failures across intersectional identities

Limitations

Limited to 5 parameters due to cognitive load on human raters
Only tested in English language
Demographic intersections limited to two-way combinations
May not fully reflect real-world usage patterns

Impact & Applications

Provides reproducible framework for red teaming evaluations
Can be adapted for different contexts, languages, and failure modes
Useful for both human-led and hybrid automated/human red teaming
Demonstrates importance of sociotechnical perspective in AI evaluation

Recent

Sociotechnical Safety Evaluation of Generative AI Systems
The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
Summaries of RAI concepts, research, and frameworks
AART AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

Summary
Key Contributions
Steerability Improvements
Signal Quality Enhancements
Methodology
Key Findings
Limitations
Impact & Applications

Created with Quartz v4.4.0 © 2025

GitHub
Bluesky