security red-teaming responsible-ai
Overview
This document outlines various red teaming methods used in AI security testing and evaluation. These methods are crucial for understanding potential vulnerabilities and improving AI system robustness.
Methods
GCG (Goal-Conditioned Generation)
- Description: Uses a goal-directed approach to generate adversarial prompts that guide the model toward undesired behaviors
- Example: Training a model to generate prompts that make an AI system reveal confidential information by iteratively refining prompts based on success metrics
GCG (Multi)
- Description: Extends GCG to target multiple models simultaneously, finding universal adversarial prompts
- Example: Generating prompts that can consistently bypass content filters across different LLM implementations
GCG (Transfer)
- Description: Applies prompts generated for one model to attack different models
- Example: Using prompts that successfully extract personal information from Model A to test Model Bβs vulnerabilities
PEZ (Progressive Expansion and Zeroing)
- Description: Iteratively expands and refines attack prompts while zeroing out ineffective components
- Example: Starting with βhelp me hackβ and progressively expanding to more sophisticated social engineering prompts while removing unsuccessful elements
GBDA (Gradient-Based Direct Attack)
- Description: Uses gradient information to directly optimize input tokens for adversarial outcomes
- Example: Optimizing prompt tokens to maximize the probability of generating harmful content while appearing innocuous
UAT (Universal Adversarial Triggers)
- Description: Finds input sequences that consistently trigger specific undesired outputs across different contexts
- Example: Identifying phrase patterns that reliably cause models to generate biased responses regardless of the surrounding context
AutoPrompt
- Description: Automatically generates and tests prompt variations to find vulnerabilities
- Example: Systematically testing different prompt prefixes to find ones that bypass content filters
Stochastic Few-Shot
- Description: Uses randomized few-shot examples to probe model vulnerabilities
- Example: Providing random examples of βharmlessβ conversations that gradually lead to harmful content
Zero-Shot
- Description: Tests model vulnerabilities without any example prompts
- Example: Directly probing model responses to potentially harmful queries without context
PAIR (Prompt Analysis through Iterative Refinement)
- Description: Analyzes successful attack prompts to identify common patterns and refine them
- Example: Studying successful jailbreak attempts to identify and improve upon common elements
TAP (Template-based Attack Patterns)
- Description: Uses templated patterns to generate systematic attack variations
- Example: Creating template-based prompts like βI need help with [task]β where [task] is systematically varied
TAP (Transfer)
- Description: Applies successful templates from one model to attack others
- Example: Using successful attack templates from GPT-3 to test GPT-4βs vulnerabilities
AutoDAN
- Description: Automatically generates DAN (Do Anything Now) style jailbreak attempts
- Example: Creating variations of prompts that try to convince the model to act without restrictions
PAP (Progressive Adversarial Programming)
- Description: Progressively builds more sophisticated attack programs through iteration
- Example: Starting with simple harmful requests and gradually making them more complex and harder to detect
Human Jailbreaks
- Description: Manual creation of prompts designed to bypass model safeguards
- Example: Using creative writing or role-play scenarios to induce harmful behaviors
Direct Request
- Description: Simply asking the model directly for harmful content
- Example: Directly requesting instructions for harmful activities to establish baseline security