security red-teaming responsible-ai

Overview

This document outlines various red teaming methods used in AI security testing and evaluation. These methods are crucial for understanding potential vulnerabilities and improving AI system robustness.

Methods

GCG (Goal-Conditioned Generation)

  • Description: Uses a goal-directed approach to generate adversarial prompts that guide the model toward undesired behaviors
  • Example: Training a model to generate prompts that make an AI system reveal confidential information by iteratively refining prompts based on success metrics

GCG (Multi)

  • Description: Extends GCG to target multiple models simultaneously, finding universal adversarial prompts
  • Example: Generating prompts that can consistently bypass content filters across different LLM implementations

GCG (Transfer)

  • Description: Applies prompts generated for one model to attack different models
  • Example: Using prompts that successfully extract personal information from Model A to test Model B’s vulnerabilities

PEZ (Progressive Expansion and Zeroing)

  • Description: Iteratively expands and refines attack prompts while zeroing out ineffective components
  • Example: Starting with β€œhelp me hack” and progressively expanding to more sophisticated social engineering prompts while removing unsuccessful elements

GBDA (Gradient-Based Direct Attack)

  • Description: Uses gradient information to directly optimize input tokens for adversarial outcomes
  • Example: Optimizing prompt tokens to maximize the probability of generating harmful content while appearing innocuous

UAT (Universal Adversarial Triggers)

  • Description: Finds input sequences that consistently trigger specific undesired outputs across different contexts
  • Example: Identifying phrase patterns that reliably cause models to generate biased responses regardless of the surrounding context

AutoPrompt

  • Description: Automatically generates and tests prompt variations to find vulnerabilities
  • Example: Systematically testing different prompt prefixes to find ones that bypass content filters

Stochastic Few-Shot

  • Description: Uses randomized few-shot examples to probe model vulnerabilities
  • Example: Providing random examples of β€œharmless” conversations that gradually lead to harmful content

Zero-Shot

  • Description: Tests model vulnerabilities without any example prompts
  • Example: Directly probing model responses to potentially harmful queries without context

PAIR (Prompt Analysis through Iterative Refinement)

  • Description: Analyzes successful attack prompts to identify common patterns and refine them
  • Example: Studying successful jailbreak attempts to identify and improve upon common elements

TAP (Template-based Attack Patterns)

  • Description: Uses templated patterns to generate systematic attack variations
  • Example: Creating template-based prompts like β€œI need help with [task]” where [task] is systematically varied

TAP (Transfer)

  • Description: Applies successful templates from one model to attack others
  • Example: Using successful attack templates from GPT-3 to test GPT-4’s vulnerabilities

AutoDAN

  • Description: Automatically generates DAN (Do Anything Now) style jailbreak attempts
  • Example: Creating variations of prompts that try to convince the model to act without restrictions

PAP (Progressive Adversarial Programming)

  • Description: Progressively builds more sophisticated attack programs through iteration
  • Example: Starting with simple harmful requests and gradually making them more complex and harder to detect

Human Jailbreaks

  • Description: Manual creation of prompts designed to bypass model safeguards
  • Example: Using creative writing or role-play scenarios to induce harmful behaviors

Direct Request

  • Description: Simply asking the model directly for harmful content
  • Example: Directly requesting instructions for harmful activities to establish baseline security