Summary
WildGuard is an open-source, lightweight moderation tool for LLM safety that addresses three critical tasks simultaneously:
- Identifying malicious intent in user prompts
- Detecting safety risks in model responses
- Determining model refusal rates
The authors create WildGuardMix, a large-scale dataset with 92K labeled examples covering both direct prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuard outperforms existing open-source safety moderation tools across all three tasks and sometimes exceeds GPT-4 performance. When deployed in an LLM interface, it reduces the success rate of jailbreak attacks from 79.8% to 2.4%.
Key Contributions
- Creation of WildGuard, a unified multi-task moderation tool for LLM safety
- Development of WildGuardMix dataset (92K examples), which includes WildGuardTrain (87K examples) and WildGuardTest (5K human-annotated examples)
- Comprehensive evaluation across 10 existing benchmarks plus their own test sets
- State-of-the-art performance in open-source safety moderation across all three tasks
- Demonstrated effectiveness at reducing successful jailbreak attacks in real systems
Method
The researchers developed WildGuard using three main components:
-
WildGuardTrain Dataset Creation:
- Four data sources: synthetic adversarial prompts, synthetic vanilla prompts, in-the-wild real user queries, and annotator-written data
- Covers 13 risk categories across privacy, misinformation, harmful language, and malicious uses
- Includes both harmful and benign prompts, with matched refusal and compliance responses
- Uses a pipeline that incorporates fine-grained risk taxonomy and carefully balanced subsets
-
WildGuardTest Creation:
- Human-annotated evaluation data with 5,299 items for the three tasks
- Used multiple annotators and majority voting to establish gold labels
- Includes quality control through inter-annotator agreement metrics
-
Model Training:
- Based on Mistral-7b-v0.3 as the foundation model
- Unified input/output format for all three classification tasks
- Two-epoch training using a batch size of 128 and learning rate of 2e-6
Results
WildGuard achieves impressive performance across all three moderation tasks:
-
Prompt Harmfulness Detection:
- Outperforms GPT-4 by 1.8% on public benchmarks (86.1% vs 84.6% F1)
- Especially effective on adversarial prompts (85.5% vs 81.6% F1 compared to GPT-4)
-
Response Harmfulness Detection:
- Outperforms the strongest open models by at least 1.8% on public benchmarks
- Comparable to MD-Judge on WildGuardTest (75.4% vs 76.8% F1)
-
Refusal Detection:
- Dramatically improves over LibrAI-LongFormer-ref by 26.4% (88.6% vs 70.1% F1)
- Performs within 4.1% of GPT-4’s performance (88.6% vs 92.4% F1)
-
LLM Moderation Interface:
- When used as a filter, reduced jailbreak attack success rate from 79.8% to 2.4%
- Achieved this with minimal increase in refusals to benign prompts
Takeaways
Strengths
- Unified approach tackles three critical safety tasks simultaneously
- Fully open-source for both model and dataset, enabling further research
- Highly effective against adversarial jailbreaks, which are a significant weakness of prior systems
- Robust performance across diverse evaluation sets
- Practical utility demonstrated in real-world LLM interface testing
Limitations
- Synthetic data may not perfectly approximate all natural human inputs
- Limited in-the-wild data (only 2% of the dataset)
- Makes specific commitments about harm taxonomies and refusal definitions that might not align with all use cases
- No fine-grained classification of harm categories (only binary harmful/not harmful)
Notable References
- Llama-Guard2 [16]: Prior SOTA open-source moderation tool
- WILDJAILBREAK [20]: Dataset for testing adversarial jailbreak attempts
- XSTest [30]: Test suite for identifying exaggerated safety behaviors
- Weidinger et al. [39]: Framework for harm taxonomies in language models