Knowledge Localization and Model Editing in Large Language Models: Causal Tracing, ROME, MEMIT, and the Geometry of Factual Memory

Abstract

Large language models (LLMs) encode vast amounts of factual knowledge in their parameters during pretraining. Understanding where and how this knowledge is stored is a fundamental question in mechanistic interpretability with significant practical consequences. This paper reviews the literature on knowledge localization and model editing in transformer-based LLMs. We examine causal tracing methodology, the Rank-One Model Editing (ROME) algorithm, its batch extension MEMIT, and more recent approaches including WISE and retrieval-augmented editing. We analyze the theoretical foundations of these methods, their empirical success and failure modes, and the open challenges that remain before model editing can serve as a reliable mechanism for knowledge updating in deployed systems.

1. Introduction

Transformer-based language models such as GPT-4, LLaMA-3, and Mistral encode billions of factual associations implicitly within their weight matrices through large-scale pretraining on web-scale corpora. Unlike symbolic knowledge bases, where facts are stored as explicit records, LLMs distribute factual knowledge across layers and heads in a manner that was, until recently, largely opaque. This creates a fundamental challenge: when a fact changes in the world — a company changes its CEO, a country changes its capital, a scientific consensus shifts — how should one update the model without full retraining?

The problem of model editing addresses precisely this question. A model editor should be able to precisely insert or modify a targeted fact $( s, r, o )$ (subject, relation, object), ensuring that the model returns the new object $o’$ when queried about the subject-relation pair, without degrading performance on unrelated knowledge, and without making the model inconsistently apply the edit (e.g., updating “The CEO of OpenAI is Sam Altman” but failing to propagate that fact when it is paraphrased). The ideal edit is specific, generalizing, and consistent — criteria formalized by Meng et al. (2022).

This paper is organized as follows. Section 2 reviews prior work on knowledge representation and editing in neural networks. Section 3 provides a technical analysis of causal tracing and the ROME/MEMIT framework, including their mathematical formulations. Section 4 discusses real-world deployment considerations, failure modes, and emerging alternatives. Section 5 concludes with a synthesis of current understanding and open problems.

2. Related Work

The question of where knowledge lives in neural networks has been studied from multiple angles. Petroni et al. (2019) demonstrated that BERT-family models store relational knowledge accessible via cloze-style prompts, establishing the “language model as knowledge base” paradigm and motivating the need to understand and manipulate this knowledge [1].

Early model editing approaches relied on fine-tuning a small subset of parameters for the target fact. Zhu et al. (2020) proposed Constrained Fine-Tuning (MEND), which trains a hypernetwork to convert gradient information into parameter updates, enabling efficient single-step edits [2]. However, gradient-based approaches often fail at locality — fine-tuning on one fact can degrade related or unrelated knowledge.

De Cao et al. (2021) introduced KNIE (Knowledge Neurons in Pretrained Transformers), arguing that specific neurons in the feed-forward network (FFN) layers function as “knowledge neurons” that gate fact retrieval [3]. They proposed a gradient-attribution method to identify and surgically modify these neurons, predating the more systematic causal tracing approach.

Meng et al. (2022) provided the first rigorous causal analysis, establishing that factual associations in GPT-style models are predominantly stored in the mid-layer MLP weights, and introduced ROME as a principled editing algorithm based on rank-one updates to key-value associations in the FFN [4]. This work became the foundation for the field.

Meng et al. (2023) scaled ROME to batch editing with MEMIT (Mass-Editing Memory In a Transformer), enabling thousands of simultaneous edits while preserving model coherence [5]. MEMIT distributes updates across multiple layers, addressing the rank-one bottleneck of ROME.

Huang et al. (2023) systematically evaluated 12 editing methods on factual recall, multi-hop reasoning, and consistency across paraphrases, revealing that most methods fail on at least one criterion and that no single approach dominates [6]. Their EVEDIT benchmark has become a standard evaluation resource.

Wang et al. (2024) introduced WISE (Wise Layer Injection with Side Effects), a retrieval-gated approach that stores edits in a separate side memory module, routing relevant queries to the edit store and irrelevant ones to the original model [7]. This architectural separation avoids interference between edits and general capabilities.

3. Technical Analysis

3.1 Causal Tracing

The causal tracing method of Meng et al. (2022) is grounded in the framework of causal mediation analysis. Given a factual prompt $x$ (e.g., “The Eiffel Tower is located in”), the method identifies which internal computations are causally responsible for the model predicting the correct answer $o$ (e.g., “Paris”).

Formally, let $\mathbf{h}_i^{(l)}$ denote the hidden state at token position $i$ and layer $l$. A clean run processes the original prompt, while a corrupted run adds noise $\epsilon$ to the embeddings of the subject tokens $s$ to disrupt factual recall. The causal effect of restoring a single hidden state from the clean run is measured as:

$$\text{IE}(i, l) = P_{\text{restored}}[o] – P_{\text{corrupted}}[o]$$

where $P_{\text{restored}}[o]$ is the probability of the correct answer after restoring $\mathbf{h}_i^{(l)}$ to its clean value in the otherwise corrupted run. High indirect effect values pinpoint the hidden states that are both necessary and sufficient for the factual prediction.

Empirically, causal tracing reveals a striking pattern: for GPT-style autoregressive models, the highest indirect effects are concentrated in the MLP layers at mid-depth (roughly layers 15–20 in GPT-2 XL) when the hidden state corresponds to the last token of the subject. This finding has been replicated across model families, though the precise layer depth shifts with model scale.

3.2 The FFN as Key-Value Memory

Geva et al. (2021) provided a complementary interpretation: each FFN layer operates as a two-layer MLP with weight matrices $W_{\text{fc}}$ (keys) and $W_{\text{proj}}$ (values). Specifically, for input $\mathbf{x}$:

$$\text{FFN}(\mathbf{x}) = W_{\text{proj}} \cdot \sigma(W_{\text{fc}} \cdot \mathbf{x})$$

Each row $\mathbf{k}_j$ of $W_{\text{fc}}$ acts as a key that fires when the input activates it, and the corresponding row $\mathbf{v}_j$ of $W_{\text{proj}}$ constitutes the value that is added to the residual stream. Factual associations can thus be thought of as key-value pairs stored in these weight matrices: the key pattern corresponds to the subject-relation context, and the value encodes the object representation.

3.3 ROME: Rank-One Model Editing

ROME exploits the key-value memory structure by formulating editing as an optimization problem over the weight matrix of a single MLP layer. Given a target edit $(s, r, o \to o’)$, the goal is to find an updated $\hat{W}$ such that:

$$\hat{W} \mathbf{k}^* = \mathbf{v}^*$$

where $\mathbf{k}^* = \mathbb{E}[\mathbf{h}^{(l)}_{\text{last}(s)} | s]$ is the expected hidden state at the last subject token in the target layer, and $\mathbf{v}^*$ is a target value vector computed by minimizing the cross-entropy loss on the new completion $o’$.

The solution takes the form of a rank-one update (analogous to the Sherman-Morrison formula):

$$\hat{W} = W + \frac{(\mathbf{v}^* – W\mathbf{k}^*)\mathbf{k}^{*\top} C^{-1}}{\mathbf{k}^{*\top} C^{-1} \mathbf{k}^*}$$

where $C = KK^\top$ is an empirical covariance matrix computed from cached key vectors over a set of representative inputs $K$. This covariance term acts as a regularizer that preserves the model’s behavior on the “typical” distribution of inputs, constraining the update to the null space of $C$ orthogonal to $\mathbf{k}^*$.

The rank-one update guarantees that the model correctly maps $\mathbf{k}^*$ to $\mathbf{v}^*$ with minimal interference to other stored associations. However, the single-layer, rank-one constraint limits ROME to small batches of edits — the weight matrix can only accommodate a finite number of independent rank-one modifications before the columns of $K$ begin to interfere.

3.4 MEMIT: Mass-Editing via Distributed Updates

MEMIT extends ROME to handle $n$ simultaneous edits by distributing the updates across $r$ consecutive MLP layers $\mathcal{L} = \{l_1, \ldots, l_r\}$. The key insight is that each editing layer needs only to “contribute” a fraction of the total residual needed to route the subject representation to the correct value. Formally, MEMIT solves:

$$\min_{\{\Delta_l\}} \sum_{l \in \mathcal{L}} \|\Delta_l\|_F^2 \quad \text{s.t.} \quad \sum_{l \in \mathcal{L}} \Delta_l \mathbf{k}_i^{(l)} = \mathbf{r}_i \quad \forall i \in [n]$$

where $\mathbf{r}_i = \mathbf{v}_i^* – W_l \mathbf{k}_i^{(l)}$ is the residual for edit $i$ at layer $l$, and $\Delta_l$ is the weight update for that layer. The minimization over Frobenius norm of the deltas with equal contribution across layers yields a closed-form solution analogous to the ROME update but applied in parallel across $r$ layers:

$$\Delta_l = R_l K_l^\top (K_l K_l^\top + \lambda C_l)^{-1}$$

where $R_l$ is the matrix of residuals assigned to layer $l$ and $\lambda$ balances locality. MEMIT has demonstrated successful batch editing of up to 10,000 facts in GPT-J (6B) and GPT-NeoX (20B) while maintaining strong performance on downstream benchmarks.

3.5 Evaluation Criteria and Failure Modes

The standard evaluation of model editors uses three criteria:

Both ROME and MEMIT score highly on efficacy and specificity but fail on multi-hop generalization — for instance, after editing “The CEO of DeepMind is $o’$”, the model should also update the answer to “Who leads the organization founded by Demis Hassabis?” but typically does not. This failure reflects the fact that ROME operates on a single subject-relation lookup and does not update the causal graph of derived facts.

Furthermore, sequential application of many ROME edits degrades model quality significantly: each rank-one update is not truly orthogonal to previous updates because the covariance matrix $C$ is computed once and not updated, so its approximation degrades over many edits.

4. Discussion: Real-World Deployment Considerations

Model editing has attracted attention from practitioners who need to maintain deployed LLMs without the cost of full retraining. Several deployment scenarios illustrate both the promise and limitations of current methods.

4.1 Knowledge Cutoff Updating

A recurring problem with deployed LLMs is their fixed knowledge cutoff. MEMIT offers a pathway to inject post-cutoff facts at low cost. OpenAI’s internal fine-tuning pipelines reportedly explored weight-update approaches for factual refreshes, though production systems currently rely more on retrieval augmentation (RAG) than weight editing. The fundamental difficulty is verification: it is hard to know in advance all the prompts that should return the new fact, making complete generalization coverage impossible to guarantee.

4.2 Detoxification and Bias Correction

Model editing has been proposed as a targeted alternative to RLHF fine-tuning for removing specific harmful associations. However, Hoelscher-Obermaier et al. (2023) demonstrated that toxicity editing via ROME does not generalize robustly — edited models can be re-elicited for toxic completions via paraphrase attacks, suggesting that the bias is distributed more widely than a single MLP layer captures.

4.3 The Retrieval-Augmented Alternative

WISE (Wang et al., 2024) represents a hybrid architectural approach. Rather than modifying existing weights, WISE trains a small “side memory” module alongside the frozen base model. At inference time, a learned router determines whether to use the base model or the side memory for a given query. This approach inherits the specificity of parametric editing while avoiding its interference problems, but requires architectural changes to the deployment infrastructure and introduces routing latency.

4.4 Scaling and Compositionality

A major open question is whether model editing scales gracefully to the editing of interrelated facts. Fact triples in the real world form a dense graph: changing one node should propagate changes to downstream inferences. Current methods treat edits as isolated updates to single $(s, r, o)$ triples and have no mechanism for graph-propagation. Recent work on ripple effect editing (Cohen et al., 2023) has begun to formalize this challenge and proposes evaluation protocols that measure consistency across multi-hop chains, but no method has demonstrated robust multi-hop generalization.

4.5 Localization Critiques

The causal tracing interpretation — that factual knowledge is localized in mid-layer MLP modules — has been challenged. Henighan et al. (2023) and Hernandez et al. (2023) showed that attention layers also play a significant role in subject attribute binding, particularly for multi-token subjects. Moreover, the causal tracing methodology itself has been critiqued for conflating correlation with causality: restoring a hidden state from the clean run and observing a performance boost does not prove that state is the unique locus of knowledge storage — it only shows that the state is part of the minimal intervention set. A fuller picture of knowledge storage likely involves distributed representations across both attention and MLP components.

5. Conclusion

Knowledge localization and model editing represent a rapidly maturing research area at the intersection of mechanistic interpretability and practical AI engineering. The causal tracing methodology of Meng et al. has provided a principled framework for understanding where factual associations reside in GPT-style models, and the ROME and MEMIT algorithms have demonstrated that targeted, low-interference edits are achievable for single and batch factual updates. However, significant challenges remain: current methods do not generalize across multi-hop inference chains, sequential edits accumulate interference, and the localization thesis itself is increasingly complicated by evidence that factual recall involves distributed contributions across attention and MLP components alike.

As language models grow larger and are deployed in increasingly dynamic knowledge environments, the ability to perform precise, auditable, and generalizing updates to model knowledge will become a practical necessity. Future progress is likely to come from hybrid architectures that combine parametric editing with retrieval-augmented memory, from improved causal formalisms that account for distributed representations, and from evaluation protocols that measure the full ripple effect of factual edits across the model’s inferred knowledge graph.

References

  1. Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? Proceedings of EMNLP 2019, 2463–2473. arXiv:1909.01066.
  2. Zhu, C., Rawlinson, D., Socher, R., & Xiong, C. (2020). Modifying memories in transformer models. arXiv preprint arXiv:2012.00363.
  3. De Cao, N., Aziz, W., & Titov, I. (2021). Editing factual knowledge in language models. Proceedings of EMNLP 2021, 6491–6506. arXiv:2104.08164.
  4. Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems (NeurIPS 2022), 35, 17359–17372. arXiv:2202.05262.
  5. Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., & Bau, D. (2023). Mass-editing memory in a transformer. International Conference on Learning Representations (ICLR 2023). arXiv:2210.07229.
  6. Huang, J., et al. (2023). EasyEdit: An easy-to-use knowledge editing framework for large language models. arXiv preprint arXiv:2308.07269.
  7. Wang, P., et al. (2024). WISE: Rethinking the knowledge memory for lifelong model editing of large language models. Advances in Neural Information Processing Systems (NeurIPS 2024). arXiv:2405.14768.
Diffusion Language Models: Score Matching, Masked Diffusion, and the Non-Autoregressive Frontier
Automatic Prompt Optimization: Discrete Search, Gradient Approximation, and the Geometry of Instruction Space

Leave a Comment

Your email address will not be published. Required fields are marked *