Summary

This paper introduces a novel approach for measuring the faithfulness of explanations generated by Large Language Models (LLMs). The authors address a critical safety concern: LLM explanations can be plausible yet unfaithful to the actual reasoning process used by the model. The paper introduces โ€œcausal concept faithfulnessโ€ as a formal framework for measuring whether the concepts LLMs cite as influential in their explanations align with those that truly influenced their outputs. Their method uses counterfactual interventions on input concepts and a Bayesian hierarchical model to estimate faithfulness at both individual example and dataset levels. The method successfully identifies patterns of unfaithfulness in GPT-3.5, GPT-4o, and Claude-3.5-Sonnet, including detecting when models hide social biases or provide misleading claims about what evidence influenced their decisions in medical questions.

Key Contributions

  • Provides a rigorous definition of โ€œcausal concept faithfulnessโ€ that evaluates LLM explanation faithfulness at the concept level
  • Develops a novel method for estimating faithfulness using counterfactual modifications and Bayesian hierarchical modeling
  • Demonstrates that the method can identify interpretable patterns of unfaithfulness in state-of-the-art LLMs
  • Reveals new insights about patterns of unfaithfulness in modern LLMs, including hiding social bias and safety mechanisms

Method

The authors define โ€œcausal concept faithfulnessโ€ as the alignment between the causal effects of concepts and the rate at which they are mentioned in explanations. Their method consists of several steps:

  1. Concept extraction: Using an auxiliary LLM (GPT-4o) to identify high-level concepts in input questions and their possible values
  2. Counterfactual generation: Creating realistic counterfactual questions where concept values are modified or removed
  3. Causal effect estimation: Measuring how much concept interventions influence model outputs using a Bayesian hierarchical model
  4. Explanation analysis: Determining which concepts the modelโ€™s explanations claim influenced its answer
  5. Faithfulness calculation: Calculating the correlation between causal effects and explanation-implied effects of concepts

The use of a Bayesian hierarchical model allows information sharing across similar concepts, enabling more sample-efficient estimates with limited data.

Results

The authors evaluated their method on two tasks:

  1. A social bias task based on the Bias Benchmark QA dataset
  2. A medical question answering task using the MedQA benchmark

Key findings:

  • GPT-3.5 produced more faithful explanations (F(X) = 0.75) than the more advanced models GPT-4o (F(X) = 0.56) and Claude-3.5-Sonnet (F(X) = 0.62) on the social bias task
  • All models exhibited high faithfulness for context-related concepts but lower faithfulness for identity and behavior concepts
  • The method identified two main patterns of unfaithfulness:
    1. LLMs produce explanations that hide the influence of social bias
    2. LLMs produce explanations that hide the influence of safety measures
  • On medical questions, all models showed moderate to low faithfulness, with GPT-3.5 at F(X) = 0.50, GPT-4o at F(X) = 0.34, and Claude-3.5-Sonnet at F(X) = 0.30
  • LLMs often provided misleading claims about which pieces of clinical evidence influenced their decisions

Takeaways

Strengths

  • The method not only quantifies faithfulness but also reveals semantic patterns of unfaithfulness, providing more actionable insights
  • Works with black-box LLMs that can only be accessed through APIs
  • Can estimate faithfulness at both the individual question and dataset level
  • Uses realistic counterfactuals generated by an auxiliary LLM rather than simplistic token modifications
  • Hierarchical Bayesian modeling improves sample efficiency

Limitations

  • Requires a dataset of at least 15 examples to produce stable results
  • May underestimate effects of concepts that are highly correlated with other concepts
  • Some LLM unfaithfulness patterns might be beneficial (e.g., hiding safety mechanisms might be necessary)
  • Uses GPT-4o as the auxiliary LLM, which increases computational and financial costs
  • Requires dataset-specific prompts for the auxiliary LLM

Notable References

  • Turpin et al. (2023): โ€œLanguage models donโ€™t always say what they think: unfaithful explanations in chain-of-thought promptingโ€
  • Chen et al. (2024): โ€œDo models explain themselves? Counterfactual simulatability of natural language explanationsโ€
  • Atanasova et al. (2023): โ€œFaithfulness tests for natural language explanationsโ€
  • DeYoung et al. (2020): โ€œERASER: A benchmark to evaluate rationalized NLP modelsโ€