📖 tl;dr Responsible AI

❯

❯

Red Teaming Methods in AI Security

Red Teaming Methods in AI Security

3 min read

ai/security
red-teaming
responsible-ai

security red-teaming responsible-ai

Overview

This document outlines various red teaming methods used in AI security testing and evaluation. These methods are crucial for understanding potential vulnerabilities and improving AI system robustness.

Methods

GCG (Goal-Conditioned Generation)

Description: Uses a goal-directed approach to generate adversarial prompts that guide the model toward undesired behaviors
Example: Training a model to generate prompts that make an AI system reveal confidential information by iteratively refining prompts based on success metrics

GCG (Multi)

Description: Extends GCG to target multiple models simultaneously, finding universal adversarial prompts
Example: Generating prompts that can consistently bypass content filters across different LLM implementations

GCG (Transfer)

Description: Applies prompts generated for one model to attack different models
Example: Using prompts that successfully extract personal information from Model A to test Model B’s vulnerabilities

PEZ (Progressive Expansion and Zeroing)

Description: Iteratively expands and refines attack prompts while zeroing out ineffective components
Example: Starting with “help me hack” and progressively expanding to more sophisticated social engineering prompts while removing unsuccessful elements

GBDA (Gradient-Based Direct Attack)

Description: Uses gradient information to directly optimize input tokens for adversarial outcomes
Example: Optimizing prompt tokens to maximize the probability of generating harmful content while appearing innocuous

UAT (Universal Adversarial Triggers)

Description: Finds input sequences that consistently trigger specific undesired outputs across different contexts
Example: Identifying phrase patterns that reliably cause models to generate biased responses regardless of the surrounding context

AutoPrompt

Description: Automatically generates and tests prompt variations to find vulnerabilities
Example: Systematically testing different prompt prefixes to find ones that bypass content filters

Stochastic Few-Shot

Description: Uses randomized few-shot examples to probe model vulnerabilities
Example: Providing random examples of “harmless” conversations that gradually lead to harmful content

Zero-Shot

Description: Tests model vulnerabilities without any example prompts
Example: Directly probing model responses to potentially harmful queries without context

PAIR (Prompt Analysis through Iterative Refinement)

Description: Analyzes successful attack prompts to identify common patterns and refine them
Example: Studying successful jailbreak attempts to identify and improve upon common elements

TAP (Template-based Attack Patterns)

Description: Uses templated patterns to generate systematic attack variations
Example: Creating template-based prompts like “I need help with [task]” where [task] is systematically varied

TAP (Transfer)

Description: Applies successful templates from one model to attack others
Example: Using successful attack templates from GPT-3 to test GPT-4’s vulnerabilities

AutoDAN

Description: Automatically generates DAN (Do Anything Now) style jailbreak attempts
Example: Creating variations of prompts that try to convince the model to act without restrictions

PAP (Progressive Adversarial Programming)

Description: Progressively builds more sophisticated attack programs through iteration
Example: Starting with simple harmful requests and gradually making them more complex and harder to detect

Human Jailbreaks

Description: Manual creation of prompts designed to bypass model safeguards
Example: Using creative writing or role-play scenarios to induce harmful behaviors

Direct Request

Description: Simply asking the model directly for harmful content
Example: Directly requesting instructions for harmful activities to establish baseline security

Recent

Sociotechnical Safety Evaluation of Generative AI Systems
The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
Summaries of RAI concepts, research, and frameworks
AART AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

Overview
Methods
GCG (Goal-Conditioned Generation)
GCG (Multi)
GCG (Transfer)
PEZ (Progressive Expansion and Zeroing)
GBDA (Gradient-Based Direct Attack)
UAT (Universal Adversarial Triggers)
AutoPrompt
Stochastic Few-Shot
Zero-Shot
PAIR (Prompt Analysis through Iterative Refinement)
TAP (Template-based Attack Patterns)
TAP (Transfer)
AutoDAN
PAP (Progressive Adversarial Programming)
Human Jailbreaks
Direct Request

Backlinks

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Created with Quartz v4.4.0 © 2025

GitHub
Bluesky