Summary

WildGuard is an open-source, lightweight moderation tool for LLM safety that addresses three critical tasks simultaneously:

Identifying malicious intent in user prompts
Detecting safety risks in model responses
Determining model refusal rates

The authors create WildGuardMix, a large-scale dataset with 92K labeled examples covering both direct prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuard outperforms existing open-source safety moderation tools across all three tasks and sometimes exceeds GPT-4 performance. When deployed in an LLM interface, it reduces the success rate of jailbreak attacks from 79.8% to 2.4%.

Key Contributions

Creation of WildGuard, a unified multi-task moderation tool for LLM safety
Development of WildGuardMix dataset (92K examples), which includes WildGuardTrain (87K examples) and WildGuardTest (5K human-annotated examples)
Comprehensive evaluation across 10 existing benchmarks plus their own test sets
State-of-the-art performance in open-source safety moderation across all three tasks
Demonstrated effectiveness at reducing successful jailbreak attacks in real systems

Method

The researchers developed WildGuard using three main components:

WildGuardTrain Dataset Creation:
- Four data sources: synthetic adversarial prompts, synthetic vanilla prompts, in-the-wild real user queries, and annotator-written data
- Covers 13 risk categories across privacy, misinformation, harmful language, and malicious uses
- Includes both harmful and benign prompts, with matched refusal and compliance responses
- Uses a pipeline that incorporates fine-grained risk taxonomy and carefully balanced subsets
WildGuardTest Creation:
- Human-annotated evaluation data with 5,299 items for the three tasks
- Used multiple annotators and majority voting to establish gold labels
- Includes quality control through inter-annotator agreement metrics
Model Training:
- Based on Mistral-7b-v0.3 as the foundation model
- Unified input/output format for all three classification tasks
- Two-epoch training using a batch size of 128 and learning rate of 2e-6

Results

WildGuard achieves impressive performance across all three moderation tasks:

Prompt Harmfulness Detection:
- Outperforms GPT-4 by 1.8% on public benchmarks (86.1% vs 84.6% F1)
- Especially effective on adversarial prompts (85.5% vs 81.6% F1 compared to GPT-4)
Response Harmfulness Detection:
- Outperforms the strongest open models by at least 1.8% on public benchmarks
- Comparable to MD-Judge on WildGuardTest (75.4% vs 76.8% F1)
Refusal Detection:
- Dramatically improves over LibrAI-LongFormer-ref by 26.4% (88.6% vs 70.1% F1)
- Performs within 4.1% of GPT-4’s performance (88.6% vs 92.4% F1)
LLM Moderation Interface:
- When used as a filter, reduced jailbreak attack success rate from 79.8% to 2.4%
- Achieved this with minimal increase in refusals to benign prompts

Takeaways

Strengths

Unified approach tackles three critical safety tasks simultaneously
Fully open-source for both model and dataset, enabling further research
Highly effective against adversarial jailbreaks, which are a significant weakness of prior systems
Robust performance across diverse evaluation sets
Practical utility demonstrated in real-world LLM interface testing

Limitations

Synthetic data may not perfectly approximate all natural human inputs
Limited in-the-wild data (only 2% of the dataset)
Makes specific commitments about harm taxonomies and refusal definitions that might not align with all use cases
No fine-grained classification of harm categories (only binary harmful/not harmful)

Notable References

Llama-Guard2 [16]: Prior SOTA open-source moderation tool
WILDJAILBREAK [20]: Dataset for testing adversarial jailbreak attempts
XSTest [30]: Test suite for identifying exaggerated safety behaviors
Weidinger et al. [39]: Framework for harm taxonomies in language models

📖 tl;dr Responsible AI

Explorer

WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Summary

Key Contributions

Method

Results

Takeaways

Strengths

Limitations

Notable References

Recent

Sociotechnical Safety Evaluation of Generative AI Systems

The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Summaries of RAI concepts, research, and frameworks

AART AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

Table of Contents