Abstract
Mechanistic interpretability aims to reverse-engineer the algorithms implemented by neural networks by identifying interpretable computational units — circuits, features, and attention heads — and characterizing how they compose to produce model behavior. Over the past three years, this program has produced a series of concrete findings: from induction heads that implement in-context learning to monosemantic features recoverable through sparse autoencoders. This paper surveys the theoretical foundations and empirical methodology of mechanistic interpretability research, analyzes key discoveries in transformer-based language models, and critically examines the gap between local circuit explanations and global model behavior. We argue that mechanistic interpretability is transitioning from a collection of case studies into a principled scientific discipline, but that fundamental obstacles — most notably superposition and polysemanticity — remain unsolved and constrain what can be claimed about model internals. We conclude by identifying the most productive open problems and methodological directions.
1. Introduction
A large language model (LLM) with tens or hundreds of billions of parameters is, at one level of description, simply a function $f: \mathbb{R}^{d_{\text{vocab}} \times T} \to \mathbb{R}^{d_{\text{vocab}}}$ mapping token sequences to next-token distributions. At another level, it is an opaque computational artifact whose internal representations resist systematic analysis. The field of mechanistic interpretability attempts to bridge this gap — not by treating the model as a black box to be probed from outside, but by identifying specific internal mechanisms that implement specific computations.
The practical stakes are considerable. As LLMs are deployed in high-stakes settings, understanding why they behave as they do — and predicting when they will fail — becomes a precondition for trustworthy deployment. Behavioral evaluation, however thorough, cannot distinguish a model that has learned a robust general rule from one that has learned a spurious correlate that will break on distribution shift. Mechanistic understanding, by contrast, provides explanations grounded in the actual computation being performed.
Mechanistic interpretability as a coherent research program was crystallized by Olah et al. (2020) in their work on circuits in convolutional networks and formalized for transformers by Elhage et al. (2021). The core thesis is that neural networks implement algorithms that can be identified by reverse-engineering the weights, and that the appropriate unit of analysis is the circuit — a subgraph of the model’s computational graph that implements a specific function. This perspective contrasts with earlier representational interpretability work (Zeiler and Fergus, 2014; Kim et al., 2018), which characterized what models represent without addressing how those representations are used.
In this paper we examine the theoretical foundations of the mechanistic program (Section 2), survey its major empirical findings in transformers (Section 3), analyze the mathematical structure of superposition and why it complicates interpretation (Section 4), discuss implications for AI safety and alignment (Section 5), and conclude with an assessment of the field’s trajectory (Section 6).
2. Related Work
Interpretability research in deep learning has a long history, but mechanistic interpretability represents a methodological departure from two dominant earlier traditions.
Feature visualization and representational analysis. Zeiler and Fergus (2014) used deconvolutional networks to visualize which input patterns maximally activate convolutional filters, establishing that early layers learn edge detectors and later layers learn semantic features. Yosinski et al. (2015) extended this with network dissection, showing systematic correspondence between units and semantic concepts. Kim et al. (2018) introduced TCAV (Testing with Concept Activation Vectors), which probes for the presence of human-defined concepts in model activations via linear classifiers. These methods characterize what is represented but leave open how representations are used in computation.
Probing classifiers. Tenney et al. (2019) and Jawahar et al. (2019) trained linear classifiers on intermediate representations of BERT to test which linguistic properties (POS tags, dependency arcs, coreference) are linearly decodable at each layer. This methodology provides evidence that transformers develop structured linguistic representations but has been criticized by Hewitt and Liang (2019) for conflating representational structure with causal relevance — a probe may decode a property that the model never uses.
Circuits and mechanistic analysis. Olah et al. (2020) introduced the circuits framework for convolutional networks, identifying specific computational subgraphs responsible for phenomena like curve detection and multimodal neurons. Elhage et al. (2021) formalized the transformer circuits framework, decomposing attention and MLP layers into independently analyzable components. Wang et al. (2022) demonstrated the methodology at scale in their IOI (Indirect Object Identification) circuit analysis of GPT-2. Nanda et al. (2023) applied it to discover that transformers solving modular arithmetic implement Fourier representations and trigonometric identities. Henighan et al. (2023) and Templeton et al. (2024) developed sparse autoencoder (SAE) methods for recovering monosemantic features from polysemantic neurons.
3. Technical Analysis
3.1 The Transformer Circuits Framework
A transformer layer consists of a multi-head attention sublayer followed by a position-wise MLP sublayer with residual connections. The residual stream formulation of Elhage et al. (2021) writes the forward pass as a sum of contributions:
$$x^{(L)} = x^{(0)} + \sum_{l=1}^{L} \left( \text{Attn}^{(l)}(x^{(l-1)}) + \text{MLP}^{(l)}(x^{(l-1)}) \right)$$
where $x^{(0)} = W_E t + W_\text{pos} p$ is the embedding of token $t$ at position $p$, and $W_E, W_\text{pos}$ are the token and position embedding matrices respectively.
Each attention head $h$ in layer $l$ computes:
$$\text{head}^{(l,h)}(x) = \text{softmax}\!\left(\frac{x W_Q^{(l,h)} (x W_K^{(l,h)})^\top}{\sqrt{d_k}}\right) x W_V^{(l,h)} W_O^{(l,h)}$$
The key insight of the circuits framework is that the composition of attention heads can be written in terms of QK circuits (which determine what the head attends to) and OV circuits (which determine what information is read and written). The effective operation of a two-layer attention-only model can be fully characterized by the rank-1 matrices $W_E^\top W_Q^{(l,h)\top} W_K^{(l,h)} W_E$ and $W_E^\top W_O^{(l,h)\top} W_V^{(l,h)} W_E$, which determine how different tokens interact.
3.2 Induction Heads
One of the cleanest mechanistic findings is the induction head circuit (Olah et al., 2022). Induction heads implement a form of in-context pattern matching: given a sequence containing $[A][B]\ldots[A]$, the induction head predicts $[B]$ at the second occurrence of $[A]$. The mechanism requires two attention heads across two layers:
- A previous-token head in layer 1 that copies information from position $i-1$ to position $i$, writing $W_O W_V x_{i-1}$ to the residual stream at position $i$.
- An induction head in layer 2 that uses Q-K matching to attend from position $j$ to positions $i$ where $x_{i-1}$ matches $x_j$, then copies $x_{i+1}$.
This circuit provides a mechanistic account of the phase transition in in-context learning ability observed at a critical model scale — the transition coincides with the formation of induction heads during training, as verified by ablation experiments showing that ablating the induction head circuit reduces in-context learning performance by a predictable amount.
3.3 Superposition and the Geometry of Representation
A central obstacle to mechanistic interpretability is superposition — the phenomenon whereby neural networks represent more features than they have dimensions, by encoding multiple features in overlapping directions. Elhage et al. (2022) provide a formal analysis. Suppose a network needs to represent $n$ features with values $f_1, \ldots, f_n$ in an $m$-dimensional space with $m \ll n$. Define the representation $\hat{x} = \sum_i f_i W_i$ where $W_i \in \mathbb{R}^m$ are feature directions. The reconstruction loss when using a ReLU nonlinearity is:
$$\mathcal{L} = \sum_i S_i \left(1 – \|W_i\|^2\right)^2 + \sum_{i \neq j} S_i S_j (W_i \cdot W_j)^2$$
where $S_i$ is the importance (frequency $\times$ salience) of feature $i$. The model faces a tradeoff: it can represent a small number of features cleanly ($W_i \cdot W_j \approx 0$ for all $i \neq j$, requiring $n \leq m$), or it can represent more features with interference, accepting higher reconstruction loss in exchange for broader coverage.
The key prediction is that whether a feature is represented in superposition depends on its sparsity — rare features can be superimposed because their interference terms rarely co-activate. This predicts that common features will tend to occupy dedicated dimensions while rare features will be embedded in superposition. Empirical evidence from Templeton et al. (2024) in a one-layer transformer supports this prediction: features with higher activation frequency are more monosemantic (activating for a narrower range of inputs).
3.4 Sparse Autoencoders for Feature Extraction
Superposition implies that directly reading a neuron’s activation conflates multiple features. The sparse autoencoder (SAE) approach (Cunningham et al., 2023; Bricken et al., 2023) attempts to recover monosemantic features by learning a sparse overcomplete basis. Given residual stream activations $x \in \mathbb{R}^d$, an SAE learns:
$$z = \text{ReLU}(W_\text{enc}(x – b_\text{pre}) + b_\text{enc}) \in \mathbb{R}^{d_\text{feat}}$$
$$\hat{x} = W_\text{dec} z + b_\text{pre}$$
with $d_\text{feat} \gg d$ (e.g., $d_\text{feat} = 16d$), trained with objective:
$$\mathcal{L}_{\text{SAE}} = \|x – \hat{x}\|_2^2 + \lambda \|z\|_1$$
The $L_1$ penalty encourages sparse activations, which is the proxy for monosemanticity — if each input activates few features, each active feature must carry a coherent signal. Bricken et al. (2023) applied this to a one-layer transformer and found that the resulting features are substantially more interpretable than raw neurons: they activate for coherent semantic categories (DNA-related tokens, Hebrew characters, legal language), exhibit clean frequency spectra in positional features, and have predictable causal effects when their values are patched.
3.5 Causal Mediation Analysis and Activation Patching
Interpretability claims are only meaningful if the identified mechanisms are causally relevant, not merely correlated with behavior. The standard tool for establishing causal relevance is activation patching (Vig et al., 2020; Pearl, 2001). Given two inputs $x_\text{clean}$ and $x_\text{corrupted}$ that differ in some controlled way, one patches the activation $a$ at position $(l, h)$ from the clean run into the corrupted run and measures the effect on output logits:
$$\Delta_\text{logit}(l, h) = \mathbb{E}\left[\log p(y \mid \text{do}(a_{l,h} = a^\text{clean}_{l,h})) – \log p(y \mid x_\text{corrupted})\right]$$
This provides a clean operationalization of causal necessity: a high $\Delta_\text{logit}$ indicates that the patched component carries information necessary for the correct behavior. Wang et al. (2022) use this methodology to identify the full circuit responsible for indirect object identification in GPT-2 medium, finding that the circuit consists of approximately 26 attention heads organized into functional groups (name movers, S-inhibition heads, duplicate token heads), and that ablating only this circuit accounts for virtually all of the model’s IOI performance.
4. Discussion
4.1 What Mechanistic Interpretability Has and Has Not Explained
The past four years have produced a handful of genuinely rigorous mechanistic explanations: induction heads, IOI circuits, modular arithmetic via Fourier features, and several attention head taxonomies. These are not merely correlational — they survive causal intervention tests and make testable predictions about behavior on held-out inputs. This is progress.
But it is progress on small, isolated behaviors in small models or in artificial settings. GPT-2 medium, the main vehicle for circuits research, has 345M parameters and 24 layers. GPT-4-class models are estimated at $10^{11}$ parameters. Whether the circuit-finding methodology scales — whether there exist clean, sparse circuits underlying behavior in frontier models — is not known. There are preliminary reasons for pessimism: as model capacity increases, the same computation can be distributed across more components, making circuits harder to isolate. Moreover, feature superposition likely worsens with scale in absolute terms even if it improves in the sense that more features can be represented per dimension.
4.2 The Polysemanticity Problem
Even setting aside superposition, individual neurons in MLP layers tend to be polysemantic — activating for multiple, apparently unrelated input types. Elhage et al. (2022) document a neuron in GPT-2 that activates for both Python code and base64-encoded strings. This is not noise: the activation patterns are consistent and the neuron has causal effects in both contexts. The SAE approach addresses polysemanticity by finding a basis in which features are monosemantic, but it raises the question of whether the monosemantic features correspond to units that the model “thinks in” or are post-hoc mathematical decompositions that the model has no privileged access to.
4.3 Implications for AI Safety
The safety motivation for mechanistic interpretability is to identify dangerous capabilities or deceptive behaviors that behavioral evaluation misses. Conceivably, a model trained to appear aligned could implement a circuit that produces aligned behavior in observed contexts while planning to behave differently when unobserved. Detecting such a circuit would require understanding how the model represents goals and plans, which is far beyond current mechanistic capabilities.
More immediately, mechanistic methods may be useful for identifying factual knowledge circuits (Meng et al., 2022 showed that factual associations can be localized to specific MLP layers and edited via rank-1 updates), bias circuits, and capability-specific circuits that could be disabled in safety-critical deployments. The ROME and MEMIT methods for model editing (Meng et al., 2022; Meng et al., 2023) represent the most concrete safety application to date, though their reliability degrades on multi-hop factual associations and their edits can have unintended side effects on related associations.
4.4 The Universality Hypothesis
A striking empirical claim in mechanistic interpretability is universality: the same circuits appear across different models and modalities. Olah et al. (2020) found that curve detectors and Gabor filters appear in multiple CNN architectures. More recently, evidence has accumulated that induction heads appear in every transformer model studied, regardless of architecture details or training data. If universality holds broadly, it suggests that specific algorithms are natural attractors in the loss landscape — that the same computational solution is discovered repeatedly because it is optimal or near-optimal for the given task. This would make mechanistic findings more generalizable and would suggest that interpretability insights transfer across model families.
5. Conclusion
Mechanistic interpretability has established a productive scientific methodology and produced genuine empirical findings about how transformers implement specific behaviors. The key advances — the circuits framework, induction head discovery, sparse autoencoder feature extraction, and causal mediation analysis — constitute a toolkit that did not exist five years ago. The IOI circuit analysis and Fourier modular arithmetic findings demonstrate that rigorous mechanistic explanations are achievable for non-trivial behaviors.
However, the field faces fundamental scaling challenges. The superposition phenomenon implies that as models grow larger and represent more features per dimension, direct neuron-level analysis becomes less tractable. SAE methods partially address this but introduce their own approximation errors and interpretive ambiguities. The circuit-finding methodology, while sound for small isolated behaviors, does not obviously scale to explaining complex multi-step reasoning in frontier models.
The most productive near-term research directions are likely: (1) scaling SAE methods to frontier models and evaluating whether the recovered features remain interpretable; (2) developing automated circuit discovery methods that reduce the manual effort currently required; (3) building a theory of universality that predicts which circuits should be universal and which architecture-specific; and (4) connecting mechanistic findings to safety-relevant model behaviors such as sycophancy, deceptive reasoning, and capability concealment.
Mechanistic interpretability will not single-handedly solve AI alignment, but it provides the only current path toward understanding what models are actually computing — which is a precondition for trusting them.
References
- Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread.
- Cunningham, H., Ewart, A., Sherborne, L., Krzyzanowski, K., & Nanda, N. (2023). Sparse Autoencoders Find Highly Interpretable Features in Language Models. arXiv:2309.08600.
- Elhage, N., Henighan, T., Joseph, N., Medad, Z., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A Mathematical Framework for Transformer Circuits. Transformer Circuits Thread.
- Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M., & Olah, C. (2022). Toy Models of Superposition. Transformer Circuits Thread.
- Hewitt, J., & Liang, P. (2019). Designing and Interpreting Probes with Control Tasks. EMNLP 2019.
- Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., & Sayres, R. (2018). Interpretability Beyond Classification Accuracy: Quantifying Interpretability of Machine Learning Models. ICML 2018.
- Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and Editing Factual Associations in GPT. NeurIPS 2022.
- Meng, K., Sharma, A., Andonian, A., Belinkov, Y., & Bau, D. (2023). Mass-Editing Memory in a Transformer. ICLR 2023.
- Nanda, N., Chan, L., Lieberum, T., Smith, J., & Steinhardt, J. (2023). Progress Measures for Grokking via Mechanistic Interpretability. ICLR 2023.
- Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom In: An Introduction to Circuits. Distill.
- Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2022). In-context Learning and Induction Heads. Transformer Circuits Thread.
- Pearl, J. (2001). Direct and Indirect Effects. Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence.
- Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N., McDougall, C., MacDiarmid, M., Freeman, C., Sumers, T., Rees, E., Batson, J., Jermyn, A., Carter, S., Henighan, T., & Olah, C. (2024). Scaling and Evaluating Sparse Autoencoders. Anthropic Research.
- Tenney, I., Das, D., & Pavlick, E. (2019). BERT Rediscovers the Classical NLP Pipeline. ACL 2019.
- Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., & Shieber, S. (2020). Investigating Gender Bias in Language Models Using Causal Mediation Analysis. NeurIPS 2020.
- Wang, K., Variengien, A., Conmy, A., Shlegeris, B., & Steinhardt, J. (2022). Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small. ICLR 2023.
- Zeiler, M. D., & Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. ECCV 2014.