LLM Comparator: Visual Analytics Tool for Side-by-Side LLM Evaluation

https://ai.google.dev/responsible/docs/evaluation/llm_comparator

Overview

LLM Comparator is a visual analytics tool developed by Google Research for conducting and analyzing automatic side-by-side evaluations of large language models (LLMs). The tool helps researchers and engineers understand when and why a model performs better or worse than a baseline model, and how their responses differ qualitatively.

Key Features

1. Interactive Table

Displays individual prompts, responses from both models, rater scores, and rationale summaries
Highlights overlapping words between responses for easy comparison
Provides summarized rationales using LLM-generated bullet points
Shows detailed rating results on demand
Uses color coding to distinguish between models and rating decisions

2. Visualization Summary Components

Score Distribution

Visualizes the distribution of scores from automatic raters
Helps understand the overall performance metrics in detail

Win Rates by Prompt Category

Shows performance across different prompt categories
Enables identification of areas where models excel or underperform
Facilitates targeted analysis of specific prompt types

Rationale Clusters

Uses LLM-based approach to generate representative themes from rater rationales
Allows comparison of rationale frequencies between models
Supports dynamic addition and regeneration of clusters
Enables filtered analysis by prompt categories

N-grams and Custom Functions

Analyzes frequently occurring phrases in model responses
Supports custom functions for pattern detection
Enables quantitative comparison of response characteristics

Implementation

Web-based application using TypeScript and Lit webcomponents
Preprocesses evaluation data using LLMs for rationale summarization and clustering
Supports dynamic filtering, sorting, and cluster assignments
Successfully deployed at Google with over 400 users analyzing 1,000+ experiments

User Study Findings

Key Usage Patterns

Example-first deep dive
- Users carefully examine individual examples
- Form hypotheses about behavioral differences
- Use filters to find similar patterns
Prior experience-based testing
- Users search for known undesirable patterns
- Compare pattern occurrences across models
- Leverage rationale summaries for verification
Rationale-centric exploration
- Enables new ways to analyze evaluation data
- Supports discovery of interesting patterns
- Facilitates hypothesis formation and testing

Future Opportunities

LLM-based custom metrics for high-level attributes
Pre-configured detection of undesirable patterns
Improved rationale clustering methods
Enhanced support for large-scale evaluations

References

Kahng, M., Tenney, I., Pushkarna, M., Liu, M. X., Wexler, J., Reif, E., … & Dixon, L. (2024). LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models. arXiv:2402.10524.

llm-evaluation visualization tool side-by-side-evaluation responsible-ai

📖 tl;dr Responsible AI

Explorer

LLM Comparator

LLM Comparator: Visual Analytics Tool for Side-by-Side LLM Evaluation

Overview

Key Features

1. Interactive Table

2. Visualization Summary Components

Score Distribution

Win Rates by Prompt Category

Rationale Clusters

N-grams and Custom Functions

Implementation

User Study Findings

Key Usage Patterns

Future Opportunities

References

Recent

Sociotechnical Safety Evaluation of Generative AI Systems

The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Summaries of RAI concepts, research, and frameworks

AART AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

Table of Contents