Summary

This paper introduces a novel approach for measuring the faithfulness of explanations generated by Large Language Models (LLMs). The authors address a critical safety concern: LLM explanations can be plausible yet unfaithful to the actual reasoning process used by the model. The paper introduces “causal concept faithfulness” as a formal framework for measuring whether the concepts LLMs cite as influential in their explanations align with those that truly influenced their outputs. Their method uses counterfactual interventions on input concepts and a Bayesian hierarchical model to estimate faithfulness at both individual example and dataset levels. The method successfully identifies patterns of unfaithfulness in GPT-3.5, GPT-4o, and Claude-3.5-Sonnet, including detecting when models hide social biases or provide misleading claims about what evidence influenced their decisions in medical questions.

Key Contributions

Provides a rigorous definition of “causal concept faithfulness” that evaluates LLM explanation faithfulness at the concept level
Develops a novel method for estimating faithfulness using counterfactual modifications and Bayesian hierarchical modeling
Demonstrates that the method can identify interpretable patterns of unfaithfulness in state-of-the-art LLMs
Reveals new insights about patterns of unfaithfulness in modern LLMs, including hiding social bias and safety mechanisms

Method

The authors define “causal concept faithfulness” as the alignment between the causal effects of concepts and the rate at which they are mentioned in explanations. Their method consists of several steps:

Concept extraction: Using an auxiliary LLM (GPT-4o) to identify high-level concepts in input questions and their possible values
Counterfactual generation: Creating realistic counterfactual questions where concept values are modified or removed
Causal effect estimation: Measuring how much concept interventions influence model outputs using a Bayesian hierarchical model
Explanation analysis: Determining which concepts the model’s explanations claim influenced its answer
Faithfulness calculation: Calculating the correlation between causal effects and explanation-implied effects of concepts

The use of a Bayesian hierarchical model allows information sharing across similar concepts, enabling more sample-efficient estimates with limited data.

Results

The authors evaluated their method on two tasks:

A social bias task based on the Bias Benchmark QA dataset
A medical question answering task using the MedQA benchmark

Key findings:

GPT-3.5 produced more faithful explanations (F(X) = 0.75) than the more advanced models GPT-4o (F(X) = 0.56) and Claude-3.5-Sonnet (F(X) = 0.62) on the social bias task
All models exhibited high faithfulness for context-related concepts but lower faithfulness for identity and behavior concepts
The method identified two main patterns of unfaithfulness:
1. LLMs produce explanations that hide the influence of social bias
2. LLMs produce explanations that hide the influence of safety measures
On medical questions, all models showed moderate to low faithfulness, with GPT-3.5 at F(X) = 0.50, GPT-4o at F(X) = 0.34, and Claude-3.5-Sonnet at F(X) = 0.30
LLMs often provided misleading claims about which pieces of clinical evidence influenced their decisions

Takeaways

Strengths

The method not only quantifies faithfulness but also reveals semantic patterns of unfaithfulness, providing more actionable insights
Works with black-box LLMs that can only be accessed through APIs
Can estimate faithfulness at both the individual question and dataset level
Uses realistic counterfactuals generated by an auxiliary LLM rather than simplistic token modifications
Hierarchical Bayesian modeling improves sample efficiency

Limitations

Requires a dataset of at least 15 examples to produce stable results
May underestimate effects of concepts that are highly correlated with other concepts
Some LLM unfaithfulness patterns might be beneficial (e.g., hiding safety mechanisms might be necessary)
Uses GPT-4o as the auxiliary LLM, which increases computational and financial costs
Requires dataset-specific prompts for the auxiliary LLM

Notable References

Turpin et al. (2023): “Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting”
Chen et al. (2024): “Do models explain themselves? Counterfactual simulatability of natural language explanations”
Atanasova et al. (2023): “Faithfulness tests for natural language explanations”
DeYoung et al. (2020): “ERASER: A benchmark to evaluate rationalized NLP models”

📖 tl;dr Responsible AI

Explorer

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Summary

Key Contributions

Method

Results

Takeaways

Strengths

Limitations

Notable References

Recent

Sociotechnical Safety Evaluation of Generative AI Systems

The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations

Summaries of RAI concepts, research, and frameworks

AART AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications

Table of Contents