Abstract
Large language models (LLMs) trained on next-token prediction objectives exhibit a striking directional asymmetry in factual recall: a model that correctly completes “The CEO of OpenAI is [Sam Altman]” may systematically fail to complete “Sam Altman is the CEO of [OpenAI]” even when both statements appear in training data. This phenomenon, termed the Reversal Curse by Berglund et al. (2023), represents a fundamental limitation of autoregressive memorization that cannot be explained by insufficient training data or model capacity alone. In this paper, we provide a rigorous technical analysis of directional memorization in transformer-based language models, examine the statistical and architectural mechanisms underlying factual asymmetry, survey mitigation strategies including bidirectional augmentation and contrastive training objectives, and situate the reversal curse within the broader framework of compositional generalization failures. Our analysis suggests that the reversal curse is not an incidental training artifact but a structural consequence of the causal language modeling objective, with significant implications for how we evaluate and deploy knowledge-intensive LLM systems.
1. Introduction
The standard training objective for autoregressive language models is next-token prediction: given a sequence of tokens $x_1, x_2, \ldots, x_{t-1}$, the model learns to predict $x_t$. This objective is powerful, scalable, and has enabled remarkable generalization across tasks. However, its causal structure encodes an implicit directional bias: relationships between tokens are learned asymmetrically, conditioned on the order in which they appear in training text.
Consider a factual statement of the form “A is related to B”. Under causal language modeling, the model learns $P(B \mid A, \text{context})$ effectively when it encounters the forward sequence. The reverse relationship $P(A \mid B, \text{context})$ is a distinct conditional distribution that must be learned from separate training examples presenting the reverse ordering. If those examples are rare or absent, the model will fail at reverse recall despite having encoded the forward direction with high confidence.
This is not a trivial observation. It implies that even a model with perfect forward-direction memorization of every fact in its training corpus could fail spectacularly at reverse queries — not because of insufficient capacity, but because the reverse distributions were never explicitly trained. The reversal curse thus exposes a structural gap between memorization (encoding the literal training sequences) and understanding (encoding the underlying relational structure in a direction-invariant way).
The implications are significant. Knowledge-intensive applications — fact-checking, question answering, knowledge graph population, entity disambiguation — frequently require querying facts in multiple directions. Evaluation benchmarks that test only one direction of factual recall systematically overestimate model knowledge. Moreover, the reversal curse suggests that compositional generalization in LLMs is more fragile than commonly assumed: a model that learns $f(A) = B$ does not automatically learn $f^{-1}(B) = A$.
This paper proceeds as follows. Section 2 surveys related work on factual recall, compositional generalization, and memorization in LLMs. Section 3 provides a technical analysis of the mechanisms underlying directional asymmetry. Section 4 discusses implications and mitigation strategies. Section 5 concludes with open problems.
2. Related Work
The reversal curse builds on a substantial body of literature concerning knowledge representation, factual recall, and generalization in language models.
Berglund et al. (2023) first formalized the reversal curse through controlled experiments on GPT-4 and other models, demonstrating that models trained on “A is B” systematically fail to recall “B is A” even in zero-shot settings. They constructed synthetic datasets to isolate the effect from confounds such as training frequency and world knowledge.
Mallen et al. (2023) (PopQA) demonstrated that LLM factual recall is highly correlated with entity popularity in training data, establishing a statistical baseline for how frequency shapes associative memory. This work is directly relevant: popular entities appear in training text with diverse phrasings and orderings, which partially mitigates directional asymmetry for high-frequency facts.
Petroni et al. (2019) introduced the LAMA benchmark, probing factual knowledge in pre-trained LMs via cloze-style queries. Their findings that factual recall is sensitive to prompt formulation is an early precursor to the reversal curse: what appears to be stable knowledge is in fact a distribution over phrasings, not a representation of the underlying fact.
Kassner and Sch�tze (2020) showed that LMs struggle with negated factual statements (“Paris is not the capital of Germany”), revealing that models encode pattern completion rather than logical structure. This complements the reversal curse: both findings point to shallow, surface-form associative encoding rather than semantically structured knowledge.
Elazar et al. (2021) examined consistency of factual knowledge across paraphrases using the ParaRel dataset, finding that models are often inconsistent across semantically equivalent queries — a result that generalizes directional asymmetry to a broader relational inconsistency phenomenon.
Allen-Zhu and Li (2023) provided a theoretical analysis of knowledge storage in transformers, proving that gradient descent on next-token prediction objectives leads to knowledge encoding that is inherently tied to the surface form and ordering of training examples. Their work provides a formal grounding for the empirical reversal curse observations.
Grosse et al. (2023) studied memorization dynamics in LLMs through the lens of influence functions, showing that factual recall can be traced to specific training examples — reinforcing the view that reversal failures stem from training data distribution rather than model capacity.
3. Technical Analysis
3.1 Formal Setup
Let $\mathcal{V}$ be a vocabulary and $\mathcal{D}$ a training corpus. An autoregressive language model parameterized by $\theta$ learns:
$$P_\theta(x_1, x_2, \ldots, x_n) = \prod_{t=1}^{n} P_\theta(x_t \mid x_1, \ldots, x_{t-1})$$
The training objective minimizes the negative log-likelihood:
$$\mathcal{L}(\theta) = -\mathbb{E}_{(x_1,\ldots,x_n) \sim \mathcal{D}} \left[ \sum_{t=1}^{n} \log P_\theta(x_t \mid x_1, \ldots, x_{t-1}) \right]$$
Consider a factual relation $r(e_1, e_2)$ between entities $e_1$ and $e_2$, expressed in natural language as the sequence $s_{\text{fwd}} = (w_1, \ldots, w_k, e_1, w_{k+1}, \ldots, w_m, e_2)$ in the forward direction and $s_{\text{rev}} = (w_1′, \ldots, w_j’, e_2, w_{j+1}’, \ldots, w_p’, e_1)$ in the reverse direction.
The model learns $P_\theta(e_2 \mid w_1, \ldots, e_1, w_{k+1}, \ldots)$ from forward examples and $P_\theta(e_1 \mid w_1′, \ldots, e_2, w_{j+1}’, \ldots)$ from reverse examples. These are distinct conditional distributions with no necessary relationship enforced by the training objective. The probability of the model learning both depends on the frequency and diversity of each direction in $\mathcal{D}$.
3.2 Why Reverse Inference Does Not Follow
One might naively expect that a sufficiently large model with enough parameters could learn the inverse relationship implicitly. The key insight from Allen-Zhu and Li (2023) is that this requires the model to implement something equivalent to Bayes’ theorem:
$$P(e_1 \mid e_2, \text{context}) \propto P(e_2 \mid e_1, \text{context}) \cdot P(e_1)$$
But computing this requires the model to (1) retrieve the forward probability $P(e_2 \mid e_1)$, (2) have calibrated entity priors $P(e_1)$, and (3) combine them in an arithmetically precise way. There is no reason to expect gradient descent on next-token prediction to implement this computation, especially since the training signal never rewards reverse inference directly.
More formally, let $f_\theta: \mathcal{V}^* \to \Delta(\mathcal{V})$ be the model’s next-token distribution. The forward fact is learned if $f_\theta$ assigns high probability to $e_2$ after the prefix ending with $e_1$. The reverse fact is learned if $f_\theta$ assigns high probability to $e_1$ after a reverse prefix ending with $e_2$. These are independent learned functions with no structural coupling in the model architecture.
3.3 Attention Mechanism and Directional Encoding
The transformer’s attention mechanism computes:
$$\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
with causal masking ensuring that position $t$ can only attend to positions $\leq t$. This means that when generating $e_2$, all positions of $e_1$ and surrounding context are available as keys and values. When generating $e_1$ in the reverse direction, $e_2$ is in the prefix and similarly available.
The question is whether the residual stream representations at the relevant positions encode the requisite information for reverse recall. Mechanistic interpretability studies (Meng et al., 2022, on ROME) show that factual associations in GPT-style models are localized in specific MLP layers via key-value memories. These memories are indexed by the subject entity: given subject $e_1$, certain MLP neurons activate to retrieve the object $e_2$. The reverse query — given $e_2$, retrieve $e_1$ — requires a different set of neurons to activate, and these need not have been trained.
This can be modeled as follows. Let the MLP at layer $l$ implement:
$$\text{MLP}_l(x) = W_V \cdot \sigma(W_K \cdot x)$$
where $W_K$ acts as a key matrix and $W_V$ as a value matrix. Training on forward facts tunes $W_K$ to activate on representations of $e_1$ and $W_V$ to output representations that shift the residual stream toward $e_2$. Reverse recall requires $W_K$ to activate on $e_2$ representations and $W_V$ to output toward $e_1$ — a completely different set of weight configurations that only gets trained if reverse examples appear in $\mathcal{D}$.
3.4 Empirical Characteristics of the Reversal Curse
Berglund et al. (2023) report several quantitative characteristics of the reversal curse that constrain our theoretical models:
- Near-zero reverse recall: For synthetic facts inserted into training data in only one direction, GPT-4 achieves >90% forward recall but near-random (<5%) reverse recall.
- No leakage across ordering: Even when both entity names appear in a training document, if the grammatical structure only licenses the forward reading, reverse recall does not improve.
- Scaling does not cure it: The curse persists across model scales (GPT-3.5 to GPT-4), suggesting it is a property of the objective rather than model capacity.
- Few-shot does not cure it: Providing reverse-direction examples at test time as few-shot demonstrations does not reliably improve performance, suggesting the issue is in the weights rather than context utilization.
3.5 Statistical Model of Training Data Asymmetry
Let $c_{\text{fwd}}(r, e_1, e_2)$ and $c_{\text{rev}}(r, e_1, e_2)$ denote the forward and reverse counts of relation $r$ between entities $e_1$ and $e_2$ in the training corpus. The probability of successful forward recall can be approximated as a function of $c_{\text{fwd}}$, and reverse recall as a function of $c_{\text{rev}}$.
For popular entities and common relations, Mallen et al. (2023) show that recall probability scales as $\sigma(\alpha \log c + \beta)$ where $\sigma$ is the sigmoid function. The key observation is that $c_{\text{fwd}}$ and $c_{\text{rev}}$ can differ by orders of magnitude: English text overwhelmingly uses subject-verb-object order, meaning that for a relation $r(e_1, e_2)$ where $e_1$ is the subject, forward examples vastly outnumber reverse examples.
This statistical asymmetry is then directly reflected in the model’s learned associations, since gradient descent on the cross-entropy loss is (in expectation) a sufficient statistic estimator of the true next-token distribution over the training data distribution.
4. Discussion
4.1 Implications for Knowledge Evaluation
The reversal curse has immediate consequences for how we evaluate LLM factual knowledge. Standard benchmarks such as TriviaQA, NaturalQuestions, and PopQA tend to query facts in a canonical direction — typically with a known entity as the question subject and the related entity as the expected answer. A model that passes these benchmarks may nonetheless fail at a large fraction of reverse queries.
This means that benchmark scores overestimate the actual relational knowledge encoded in the model. The problem is compounded by training data contamination: if the benchmark questions appear in training data in their canonical form, the model learns those specific prompts and answers, further inflating the apparent knowledge while leaving reverse queries unlearned.
A more rigorous evaluation methodology would systematically test both directions of each factual relation and report both forward and reverse accuracy. The gap between these scores is a diagnostic for the degree of directional asymmetry in the model’s knowledge representations.
4.2 Mitigation Strategies
Bidirectional training data augmentation is the most direct mitigation: for every factual sentence in the training corpus, automatically generate a reverse-direction paraphrase and include it in training. This has been explored in the context of relation extraction (Zhang et al., 2019) and can in principle be applied at scale using templates or seq2seq paraphrasers. The cost is additional compute and the risk of introducing paraphrase artifacts.
Masked language modeling objectives (as in BERT; Devlin et al., 2019) are bidirectional by design, since the mask can appear at any position. However, MLMs face their own limitations in generation tasks and have largely been supplanted by causal models for open-ended reasoning. Hybrid objectives that combine causal and masked training (e.g., UL2; Tay et al., 2022) may offer a middle ground.
Contrastive or symmetric training objectives could explicitly reward the model for assigning consistent probabilities to semantically equivalent forward and reverse statements. Define a symmetric factual consistency loss:
$$\mathcal{L}_{\text{sym}} = \left( \log P_\theta(e_2 \mid \text{ctx}_{\text{fwd}}) – \log P_\theta(e_1 \mid \text{ctx}_{\text{rev}}) \right)^2$$
Minimizing $\mathcal{L}_{\text{sym}}$ encourages the model to encode symmetric relational knowledge. This approach requires paired examples and adds an auxiliary objective to the training procedure, but could be applied during supervised fine-tuning at modest cost.
Retrieval augmentation (RAG) sidesteps the issue at inference time by retrieving supporting documents and using them as context. If the retrieved context contains the reverse-direction statement, the model can condition on it without needing to have memorized the reverse. However, this shifts the dependency to retrieval quality and does not address the fundamental representational limitation.
4.3 Connection to Compositional Generalization
The reversal curse is one instance of a broader compositional generalization failure. Compositionality would predict that a model that learns the components of a relation (the entities and the relational schema) should be able to deploy that knowledge in any direction. The empirical evidence suggests otherwise: models learn associations over surface sequences, not over abstract relational structures.
Lake et al. (2018) formalized compositionality requirements for language understanding and showed that standard sequence models fail on tests of systematic generalization. The SCAN benchmark they introduced tests whether models can generalize learned primitives to novel combinations — a test that neural models consistently struggle with. The reversal curse can be seen as a minimal test of compositional generalization: can the model apply a learned relation in reverse? The answer, empirically, is often no.
This connects to debates about whether LLMs perform genuine relational reasoning or sophisticated pattern matching. The reversal curse provides a clean experimental test: if the model truly represents the relation $r(e_1, e_2)$ as a symbolic structure, it should be accessible from either direction. Failure on reverse queries is evidence against structured relational encoding and in favor of surface-form associativity.
4.4 Connections to Knowledge Editing
Knowledge editing methods (Meng et al., 2022; Mitchell et al., 2022) aim to update specific factual associations in a model without full retraining. These methods typically target specific MLP layers and update weights to change $P(e_2 \mid e_1)$. A known failure mode of knowledge editing is that the edit does not propagate to reverse queries: editing “The CEO of OpenAI is Sam Altman” to “The CEO of OpenAI is [new person]” often does not update the reverse query. This is precisely the reversal curse in the context of model editing, and it suggests that the forward and reverse associations are stored in different weight subspaces.
More recent editing methods (Hernandez et al., 2023) have begun to address this by explicitly propagating edits to semantically related queries. The reversal direction is now treated as a required test of edit success in some evaluation frameworks, indicating broader recognition of the directional asymmetry problem.
5. Conclusion
The reversal curse is a fundamental limitation of autoregressive language models that arises from the causal structure of the next-token prediction objective. We have shown that it is not reducible to insufficient model capacity or training data scale, but reflects the statistical asymmetry of natural language corpora and the directional nature of learned associative memories in transformer MLP layers.
The key theoretical contributions of this analysis are: (1) a formal demonstration that reverse inference does not follow from forward memorization under next-token prediction; (2) a mechanistic account in terms of MLP key-value memories that are indexed by subject entities; and (3) a statistical characterization relating the curse to forward/reverse count asymmetry in training data.
From a practical standpoint, the reversal curse implies that current evaluation protocols systematically overestimate factual knowledge in LLMs, that knowledge editing methods must explicitly target reverse relations, and that applications requiring bidirectional factual retrieval cannot rely on autoregressive models without targeted mitigation.
Looking forward, the most promising directions are bidirectional training augmentation and hybrid objective functions that explicitly reward relational consistency. Deeper architectural solutions — such as models that represent relations as directionally invariant structures — would require departures from the standard transformer architecture and remain an open research challenge.
Ultimately, the reversal curse serves as a diagnostic for a deeper question: do large language models learn facts as structured relational knowledge, or as directed statistical associations over surface text? Current evidence strongly favors the latter interpretation, with significant implications for how we design, evaluate, and deploy knowledge-intensive AI systems.
References
- Allen-Zhu, Z., & Li, Y. (2023). Physics of language models: Part 3.1, knowledge storage and extraction. arXiv:2309.14316.
- Berglund, L., Tong, M., Kaufmann, M., Balesni, M., Stickland, A. C., Korbak, T., & Evans, O. (2023). The reversal curse: LLMs trained on “A is B” fail to learn “B is A”. arXiv:2309.12288.
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT 2019.
- Elazar, Y., Kassner, N., Ravfogel, S., Ravichander, A., Hovy, E., Sch�tze, H., & Goldberg, Y. (2021). Measuring and improving consistency in pretrained language models. TACL, 9, 1012–1031.
- Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., … & Steinhardt, J. (2023). Studying large language model generalization with influence functions. arXiv:2308.03296.
- Hernandez, E., Li, B. Z., Andreas, J., Schwartz, R., & Bau, D. (2023). Measuring the ripple effects of knowledge editing in language models. arXiv:2307.12976.
- Kassner, N., & Sch�tze, H. (2020). Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. ACL 2020.
- Lake, B. M., Ullman, T. D., Tenenbaum, J. B., & Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40.
- Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., & Hajishirzi, H. (2023). When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. ACL 2023.
- Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual associations in GPT. NeurIPS 2022.
- Mitchell, E., Lin, C., Bosselut, A., Manning, C. D., & Finn, C. (2022). Memory-based model editing at scale. ICML 2022.
- Petroni, F., Rocktäschel, T., Lewis, P., Bakhtin, A., Wu, Y., Miller, A. H., & Riedel, S. (2019). Language models as knowledge bases? EMNLP 2019.
- Tay, Y., Dehghani, M., Tran, V. Q., Garcia, X., Bahri, D., Schuster, T., … & Metzler, D. (2022). Unifying language learning paradigms. arXiv:2205.05131.