Sparse Autoencoders for Mechanistic Interpretability: Feature Discovery, Superposition, and the Dictionary Learning Approach to Language Model Internals

Abstract

Understanding the internal representations of large language models (LLMs) is a central challenge in mechanistic interpretability. A dominant hypothesis holds that models represent more features than they have neurons, encoding them in superposed, polysemantic activations that resist straightforward analysis. Sparse autoencoders (SAEs) have emerged as a principled tool for decomposing these representations into interpretable, monosemantic features via dictionary learning. By training an overcomplete linear dictionary with sparsity constraints over model activations, SAEs recover features that align with human-interpretable concepts, circuit motifs, and causal mechanisms. This paper reviews the theoretical foundations of SAE-based interpretability, surveys key empirical findings, analyzes the mathematical structure of superposition and reconstruction fidelity, and discusses open problems including feature completeness, absorption artifacts, and the scalability of SAE-based auditing to frontier models. We argue that sparse autoencoders represent a foundational methodology for aligning our understanding of neural computation with the high-dimensional geometry of transformer residual streams.

1. Introduction

Modern large language models trained at scale exhibit surprising capabilities—reasoning, code synthesis, multilingual understanding—yet remain largely opaque at the level of individual computations. Mechanistic interpretability seeks to reverse-engineer these systems: to identify what computations occur, where they occur, and why they produce observed behaviors. Early work focused on attention pattern visualization (Vig, 2019) and probing classifiers (Belinkov, 2022), but these methods largely describe correlational structure rather than causal mechanisms.

A more foundational challenge was identified by Elhage et al. (2022) in their analysis of superposition: the phenomenon whereby neural networks represent more features than they have dimensions by encoding features as directions in an overcomplete basis, with interference managed through sparsity of activation. This insight implies that individual neurons are not the right unit of analysis—they are polysemantic, activating for multiple unrelated concepts. Understanding a model’s knowledge requires recovering the underlying monosemantic features, not reading off neuron activations.

Sparse autoencoders offer a data-driven solution to this problem. Trained on the activations of a model layer, an SAE learns an overcomplete dictionary $\mathbf{D} \in \mathbb{R}^{d \times m}$ with $m \gg d$, and a sparse encoder $f$ such that for any activation $\mathbf{x} \in \mathbb{R}^d$, the reconstruction $\hat{\mathbf{x}} = \mathbf{D} f(\mathbf{x})$ minimizes reconstruction error while maintaining sparsity in $f(\mathbf{x})$. Each column of $\mathbf{D}$—a dictionary atom—can be interpreted as a candidate feature direction, and the entries of $f(\mathbf{x})$ indicate which features are active for a given input.

The appeal of this approach is substantial. Unlike probing classifiers, which require labeled data and measure only pre-specified concepts, SAEs are unsupervised and can in principle discover the full set of features a model uses. Unlike activation patching, which identifies where information is stored but not what that information is, SAEs identify the content of representations. The combination of these methods—using SAEs to identify features, then patching them to verify causal roles—constitutes one of the most promising frameworks in current interpretability research.

This paper is organized as follows. Section 2 reviews related work on interpretability and dictionary learning. Section 3 provides a technical analysis of the SAE objective, superposition geometry, and feature quality metrics. Section 4 discusses empirical findings and open problems. Section 5 concludes with directions for future research.

2. Related Work

Elhage et al. (2022) — “Toy Models of Superposition” (Transformer Circuits Thread, Anthropic) established the theoretical framework for understanding superposition in neural networks. Using low-dimensional models trained on synthetic tasks, the authors demonstrated that networks can encode up to $O(d \log d)$ approximately orthogonal features in $d$ dimensions when features are sparse, creating a theoretical basis for dictionary-learning-based decomposition.

Bricken et al. (2023) — “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning” (Anthropic) provided the first large-scale application of SAEs to a transformer language model (one-layer MLP activations in a 512-feature model). The work demonstrated recovery of features corresponding to identifiable human concepts including DNA sequences, legal terminology, and emotional tone, and introduced automated interpretability scoring to evaluate feature quality at scale.

Cunningham et al. (2023) — “Sparse Autoencoders Find Highly Interpretable Features in Language Models” (arXiv:2309.08600) extended SAE analysis to GPT-2 small and medium, showing that SAE features exhibit stronger monosemanticity, causal influence on model outputs, and coverage of known mechanistic circuits compared to raw neuron activations. This paper introduced the reconstruction-interpretability Pareto frontier as a key evaluation framework.

Templeton et al. (2024) — “Scaling and Evaluating Sparse Autoencoders” (Anthropic) trained SAEs on Claude 3 Sonnet with up to 34 million features, finding that scaling SAE width dramatically improved feature quality and coverage. The paper identified multimodal feature distributions and provided evidence that attention outputs and residual stream activations encode distinct feature sets.

Gao et al. (2024) — “Scaling and evaluating sparse autoencoders” (OpenAI, arXiv:2406.04093) conducted a complementary large-scale study, introducing the TopK SAE architecture that enforces exact sparsity at training time, improving training stability and enabling precise control over the reconstruction-sparsity tradeoff. The authors also introduced quantitative automated interpretability evaluation using LLM-based scoring.

Lieberum et al. (2024) — “Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2” (arXiv:2408.05147) released a comprehensive set of SAEs trained on all layers and sublayers of Gemma 2 models, providing the first truly systematic cross-layer feature analysis and enabling the research community to study feature evolution across depth.

Wright and Sharkey (2024) — “Addressing Feature Suppression in SAEs” introduced the concept of feature absorption—where one SAE feature absorbs part of another’s role—and proposed training modifications to mitigate it, highlighting that naive SAE training can produce systematically misleading feature decompositions.

3. Technical Analysis

3.1 The Superposition Hypothesis and Its Geometry

Let a neural network layer maintain an activation vector $\mathbf{x} \in \mathbb{R}^d$. Suppose the model has learned to represent $m$ binary or near-sparse features $z_1, \ldots, z_m$ with $m \gg d$. Under the superposition hypothesis (Elhage et al., 2022), the representation is approximately:

$$\mathbf{x} \approx \sum_{i=1}^{m} z_i \mathbf{f}_i$$

where $\mathbf{f}_i \in \mathbb{R}^d$ are feature vectors. When features are sparse—i.e., few $z_i$ are simultaneously nonzero—the expected interference between features is low even if $\|\mathbf{f}_i \cdot \mathbf{f}_j\|$ is nonzero for $i \neq j$. The pairwise interference $\mathbf{f}_i^\top \mathbf{f}_j$ contributes to reconstruction error only when both $z_i$ and $z_j$ are simultaneously active.

More precisely, let each feature be active with probability $p \ll 1$ independently. The expected squared interference in recovering feature $i$ from $\mathbf{x}$ is:

$$\mathbb{E}\left[\left(\sum_{j \neq i} z_j (\mathbf{f}_i^\top \mathbf{f}_j)\right)^2\right] = p \sum_{j \neq i} (\mathbf{f}_i^\top \mathbf{f}_j)^2$$

For a uniform nearly-orthogonal frame with $\|\mathbf{f}_i\|=1$ and $m$ features in $\mathbb{R}^d$, the average squared inner product is approximately $1/d$, giving expected interference $\approx p(m-1)/d$. This quantity can be small even when $m \gg d$, provided $p \ll d/m$. The implication is profound: the network can store vastly more features than dimensions by exploiting sparsity, with tolerable reconstruction fidelity.

3.2 The Sparse Autoencoder Objective

Given a set of model activations $\{\mathbf{x}_1, \ldots, \mathbf{x}_N\}$ at a particular layer, the SAE learns an encoder $f_\theta: \mathbb{R}^d \to \mathbb{R}^m_+$ and a dictionary $\mathbf{D} \in \mathbb{R}^{d \times m}$ by minimizing:

$$\mathcal{L}(\theta, \mathbf{D}) = \frac{1}{N}\sum_{n=1}^{N} \left\| \mathbf{x}_n – \mathbf{D} f_\theta(\mathbf{x}_n) \right\|_2^2 + \lambda \|f_\theta(\mathbf{x}_n)\|_1$$

The ReLU encoder with learned bias is given by $f_\theta(\mathbf{x}) = \text{ReLU}(\mathbf{W}_\text{enc}(\mathbf{x} – \mathbf{b}_\text{pre}) + \mathbf{b}_\text{enc})$, where $\mathbf{W}_\text{enc} \in \mathbb{R}^{m \times d}$ and $\mathbf{b}_\text{enc}, \mathbf{b}_\text{pre} \in \mathbb{R}^m, \mathbb{R}^d$. The $\ell_1$ penalty promotes sparsity, driving most encoder outputs to zero for any given input. The columns of $\mathbf{D}$ are constrained to unit norm to prevent the trivial solution of scaling down $\mathbf{D}$ while scaling up the encoder weights.

An alternative is the TopK SAE (Gao et al., 2024), which replaces the $\ell_1$ penalty with a hard constraint on sparsity:

$$f_\theta(\mathbf{x}) = \text{TopK}(\mathbf{W}_\text{enc}\mathbf{x} + \mathbf{b}_\text{enc}, k)$$

where $\text{TopK}$ retains only the $k$ largest pre-activations and zeros the rest. This provides exact, deterministic sparsity and decouples the sparsity hyperparameter from the reconstruction-sparsity tradeoff governed by $\lambda$ in the $\ell_1$ formulation. Empirically, TopK training is more stable and produces fewer dead features (dictionary atoms that are never activated).

3.3 Feature Quality Metrics

Evaluating whether SAE features are meaningful requires multiple complementary metrics:

Reconstruction Fidelity: The fraction of variance in $\mathbf{x}$ explained by $\hat{\mathbf{x}} = \mathbf{D} f(\mathbf{x})$, measured by $1 – \|\mathbf{x} – \hat{\mathbf{x}}\|^2 / \|\mathbf{x}\|^2$. High reconstruction fidelity is necessary but not sufficient for interpretability.

L0 Sparsity: The average number of nonzero entries in $f(\mathbf{x})$, i.e., $\mathbb{E}[\|f(\mathbf{x})\|_0]$. A well-trained SAE should have $L0 \ll m$. Typical values range from 20–100 active features per token out of tens of thousands of dictionary atoms.

Dead Feature Rate: The fraction of dictionary atoms that have zero activation across a large evaluation set. High dead rates indicate wasted dictionary capacity and can arise from poor initialization or overly aggressive $\ell_1$ penalty.

Automated Interpretability Score: Following Bricken et al. (2023), features can be evaluated by asking a capable LLM to (1) generate a description of the top activating examples for a feature, and (2) predict which examples from a held-out set would activate the feature. The correlation between predicted and actual activations provides a quantitative measure of human-interpretable structure. Gao et al. (2024) operationalized this as a Pearson correlation score, enabling comparison across SAE configurations at scale.

Causal Faithfulness: A feature is causally faithful if intervening on its activation—adding or subtracting the corresponding dictionary column from the residual stream—predictably changes model outputs in the direction consistent with the feature’s semantic interpretation. This can be measured via activation patching experiments or steering vector interventions.

3.4 The Reconstruction-Sparsity Pareto Frontier

There is an inherent tension between reconstruction fidelity and sparsity. As $\lambda$ increases (or $k$ decreases), the SAE uses fewer features per input, reducing interference but also reducing reconstruction quality. This Pareto frontier can be traced by training SAEs at multiple $\lambda$ values and plotting $\mathbb{E}[\|\mathbf{x} – \hat{\mathbf{x}}\|^2]$ against $\mathbb{E}[\|f(\mathbf{x})\|_0]$.

Cunningham et al. (2023) observed that SAE features on the Pareto frontier systematically outperform individual neurons in automated interpretability scoring—i.e., for a given sparsity level, SAE features are more interpretable than the neuron basis. This is consistent with the superposition hypothesis: neuron activations are mixtures of features, while SAE features more closely align with the model’s underlying representational primitives.

Importantly, Templeton et al. (2024) found that scaling SAE width $m$ (with fixed $d$) consistently improves both reconstruction fidelity and interpretability at a given sparsity, suggesting that a larger dictionary allows finer-grained feature decomposition without sacrificing coverage.

3.5 Theoretical Connections to Compressed Sensing

The SAE problem is closely related to sparse recovery in compressed sensing (Cand�s and Wakin, 2008). Given a measurement $\mathbf{x} = \mathbf{D}\mathbf{z}$ where $\mathbf{z}$ is $k$-sparse, the Restricted Isometry Property (RIP) guarantees exact recovery of $\mathbf{z}$ via $\ell_1$ minimization when $\mathbf{D}$ satisfies $\delta_{2k} < \sqrt{2} - 1$, where $\delta_{2k}$ is the RIP constant. However, the SAE setting differs importantly: the dictionary $\mathbf{D}$ is learned from data rather than known in advance, and the ground truth sparse code $\mathbf{z}$ is itself not directly observed. The learner must simultaneously identify the correct dictionary and recover the sparse codes, a problem known as dictionary learning or sparse coding (Olshausen and Field, 1997).

The convergence properties of alternating minimization for dictionary learning are well-studied under incoherence assumptions (Arora et al., 2015), but these theoretical results typically assume the generative model exactly matches the sparse coding objective. For language model activations, there is no guarantee that the true generative process is exactly sparse linear, and the degree to which SAE features correspond to true underlying features rather than approximate reconstructions remains an open question.

4. Discussion

4.1 Empirical Findings and Successes

SAE-based interpretability has produced several striking empirical results. Bricken et al. (2023) identified features in a one-layer MLP that exhibited clear conceptual specialization: one feature activated for input tokens related to legal documents, another for numerical quantities in financial contexts, a third for biological sequence notation. Critically, these features were often active for very different surface-level token patterns that shared underlying semantic content, suggesting the SAE recovered genuine abstractions rather than surface statistics.

Templeton et al. (2024) extended this to Claude 3 Sonnet and found features corresponding to emotionally charged concepts, including features associated with the model’s sense of identity. The paper’s finding of a “Golden Gate Bridge” feature—active for references to the bridge across multiple languages and modalities—became widely discussed as evidence that SAEs can recover factual world-model representations rather than purely syntactic patterns.

The Gemma Scope release (Lieberum et al., 2024) enabled cross-layer analysis, revealing that features evolve significantly across depth: early layers encode primarily syntactic and positional features, while later layers encode semantic and task-relevant features. Some features persist across many layers while others are transient, suggesting distinct processing phases consistent with the residual stream view of transformer computation.

4.2 Feature Absorption and Other Failure Modes

Despite these successes, SAE-based interpretability faces serious methodological challenges. Feature absorption (Wright and Sharkey, 2024) occurs when one feature in the dictionary absorbs activations that conceptually belong to another. For example, if the model represents “Paris” and “capital of France” as correlated features, an SAE might learn a single combined feature rather than separating them, incorrectly suggesting the model lacks the finer-grained representation. Absorption is difficult to detect automatically and may lead to systematically incorrect conclusions about model knowledge.

A related problem is feature splitting: as the dictionary size increases, previously unified features may split into subtly different variants (e.g., “positive sentiment in news text” and “positive sentiment in reviews”). Whether this reflects genuine latent structure or overfitting to distributional statistics of the training corpus is unclear. The appropriate granularity of feature decomposition is not a priori obvious and may depend on the downstream application.

There are also training stability concerns. SAE training is sensitive to initialization, learning rate scheduling, and the balance between the reconstruction and sparsity terms. Dead features can consume large fractions of dictionary capacity if sparsity is too aggressive, while insufficient sparsity produces polysemantic features that behave like individual neurons. The TopK architecture (Gao et al., 2024) mitigates some of these issues, but optimal training configurations remain empirically discovered rather than theoretically derived.

4.3 Scalability and the Frontier Model Challenge

A critical open question is whether SAE-based interpretability scales gracefully to frontier models. Templeton et al. (2024) demonstrated SAEs on Claude 3 Sonnet, a large production model, and found qualitatively similar results to smaller-scale experiments. However, exhaustive coverage of features in a frontier model may require dictionaries of billions of atoms—a scale at which the computational cost of training, storing, and reasoning over SAE features becomes formidable.

Moreover, the automated interpretability scoring methodology itself relies on capable LLMs to evaluate feature interpretability, creating a potential circularity: the interpretability of large models is assessed using similarly capable models, and systematic biases in the evaluator could produce misleadingly optimistic scores. Developing evaluator-independent measures of feature quality is an important methodological priority.

4.4 Connections to Circuit Analysis

SAEs are most powerful when combined with circuit-level analysis (Conmy et al., 2023; Wang et al., 2022). Circuits identify the computational pathways through a model—which attention heads and MLP layers contribute to a specific behavior, and via which information channels. SAEs can be applied to the activations along these pathways to identify what information is being computed, complementing circuit analysis’s identification of where computation occurs. The combination has been applied to tasks including indirect object identification, modular arithmetic, and factual recall, providing mechanistic accounts that extend from circuit topology to representational content.

A promising frontier is using SAE features as a causal graph vocabulary: representing model computations as transformations on interpretable feature activations, enabling formal analysis of information flow, feature composition, and the conditions under which specific behaviors are triggered or suppressed. This vision requires both methodological advances in SAE training and new theoretical frameworks connecting sparse feature representations to circuit-level computation.

5. Conclusion

Sparse autoencoders have established themselves as a foundational tool in mechanistic interpretability, providing a principled, scalable methodology for decomposing polysemantic neural representations into interpretable features. The theoretical basis in superposition geometry, the empirical evidence for feature quality and causal relevance, and the ability to scale to production-scale models make SAEs one of the most promising avenues for understanding what language models know and how they compute it.

Yet the field remains early-stage. Core open problems—feature absorption, appropriate dictionary size, completeness of coverage, scalability of evaluation, and the integration with circuit-level analysis—require sustained theoretical and empirical attention. The development of gold-standard benchmarks for feature quality, independent of LLM-based automated evaluation, is particularly pressing.

Looking forward, SAE-based interpretability may prove transformative for AI safety. If we can reliably identify the features that activate during dangerous reasoning patterns, encode misleading self-representations, or drive capability elicitation, we gain a concrete handle on model behavior that probing and behavioral evaluation alone cannot provide. The path from current SAE methodology to that goal is long, but the foundational work reviewed here suggests it is tractable.

References

Grouped Query Attention and Multi-Query Attention: KV Cache Compression, Inference Efficiency, and the Memory Bandwidth Bottleneck in Large Language Models
Grokking and Delayed Generalization in Neural Networks: Phase Transitions, Weight Norm Dynamics, and the Mechanisms of Late-Stage Representation Learning

Leave a Comment

Your email address will not be published. Required fields are marked *