Abstract
Large language models equipped with extended context windows exhibit systematic, position-dependent performance degradation that is not explained by perplexity alone. Two related but distinct phenomena have attracted significant empirical and theoretical attention: attention sinks, in which a disproportionate share of the softmax attention mass concentrates on initial tokens regardless of their semantic content, and the lost-in-the-middle effect, in which models fail to utilize information placed in the middle of a long context even when it is retrievable from positions near the beginning or end. This article provides a unified technical analysis of both phenomena: their empirical characterization, mechanistic hypotheses, mathematical underpinnings, and emerging mitigation strategies. We argue that these pathologies arise from the interplay of causal masking geometry, rotary positional embeddings, training distribution mismatch, and learned attention calibration failures. Understanding them is not merely an engineering concern—it constitutes a fundamental question about how autoregressive transformers represent and access information over extended sequences.
1. Introduction
The promise of long-context language models is straightforward: if a model can attend to tens or hundreds of thousands of tokens simultaneously, it should be able to integrate information across entire documents, codebases, or multi-turn conversations without the lossy compression of retrieval-augmented pipelines. Vendors have accordingly raced to extend context windows—GPT-4 to 128K tokens, Gemini 1.5 to over a million, Claude 3 to 200K—and the field has largely treated context length as a simple quality dial: longer is better.
This framing obscures a more uncomfortable reality. Models with nominally long context windows exhibit marked performance heterogeneity depending on where in the context the relevant information is located, not merely whether it is present. Liu et al. (2023) demonstrated this sharply in the multi-document question answering setting: when the answer-bearing document was placed in the middle of a long input, model accuracy fell substantially compared to placement at the beginning or end—a U-shaped performance curve they termed the lost-in-the-middle phenomenon. Independently, Xiao et al. (2023) identified attention sinks: initial tokens (particularly the first token, often a BOS or punctuation token) receive disproportionately large attention weights across layers, acting as “sink” tokens that absorb probability mass that would otherwise be diffused over the context.
These phenomena are not independent. The primacy bias encoded by attention sinks may directly cause the U-shaped performance curve: early tokens receive elevated attention through two mechanisms simultaneously—the sink effect and genuine semantic relevance near sequence start. The recency bias (elevated performance for information near the end) likely reflects the way positional embeddings interact with causal masking near the decoding position. Together, they suggest that the “effective context” of current long-context models is not the full nominal window but a heavily position-weighted subset of it.
This article proceeds as follows. Section 2 reviews the key empirical findings and related theoretical work. Section 3 provides a mathematical analysis of the mechanisms, including attention sink formation under softmax calibration and the geometry of RoPE at extended sequence lengths. Section 4 discusses implications and mitigation strategies. Section 5 concludes with open problems.
2. Related Work
Liu et al. (2023) — “Lost in the Middle: How Language Models Use Long Contexts” — conducted the foundational empirical characterization of position-dependent performance. Using multi-document question answering and key-value retrieval tasks, they tested GPT-3.5-Turbo, GPT-4, Claude, and open-source models with inputs ranging from 10 to 30 documents. All models exhibited significantly degraded performance when the relevant document was placed at input positions 10–20 (middle), with strong primacy (position 0) and recency (last position) advantages. The degradation was observed even when the model had sufficient nominal context capacity, confirming that the failure is one of utilization, not representation.
Xiao et al. (2023) — “Efficient Streaming Language Models with Attention Sinks” — identified and named the attention sink phenomenon. Through attention weight visualization across LLaMA, GPT-2, Pythia, and MPT, they observed that the first one to four tokens consistently accumulated large attention weights throughout all layers, regardless of semantic content. This observation led to StreamingLLM, a method that preserves KV cache entries for the initial sink tokens and a sliding window of recent tokens, enabling efficient streaming inference without full context recomputation. The sink mechanism also explains why simply discarding early KV cache entries causes catastrophic perplexity spikes, while preserving even non-semantic initial tokens preserves fluency.
Press et al. (2022) — “Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation” (ALiBi) — proposed a positional embedding scheme in which a fixed, non-learned bias is subtracted from attention logits proportional to the distance between query and key tokens. ALiBi models exhibit substantially better length extrapolation than absolute or RoPE-based embeddings, consistent with the hypothesis that positional encoding geometry underlies some long-context failure modes.
Su et al. (2024) — “RoFormer: Enhanced Transformer with Rotary Position Embedding” — introduced Rotary Position Embeddings (RoPE), now the dominant positional encoding in modern open-weight LLMs. RoPE encodes relative position by rotating query and key vectors in the complex plane, with the rotation angle depending on frequency bands assigned to embedding dimensions. The theoretical properties of RoPE at sequence lengths exceeding the training distribution are not well understood; position interpolation techniques (Chen et al., 2023) that rescale rotary frequencies have been proposed to address this, but they do not fully resolve sink or middle-loss pathologies.
Dettmers et al. (2022) — “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale” — while primarily a quantization paper, introduced the concept of outlier features: a small fraction of embedding dimensions that carry systematically large magnitudes and resist quantization. These same outlier dimensions appear to be mechanistically related to attention sink formation, suggesting that sinks and quantization instability share a common representational root in how transformers handle “garbage collection” of probability mass under causal softmax constraints.
Han et al. (2023) — “LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models” — analyzed the out-of-distribution nature of long sequence positions for RoPE models and proposed \(\Lambda\)-shaped attention masking to suppress distant token interactions, offering a complementary perspective to StreamingLLM on the geometric origins of long-context degradation.
3. Technical Analysis
3.1 The Softmax Calibration Problem and Attention Sink Formation
Consider the standard scaled dot-product attention for a single head:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$
For an autoregressive model generating token at position $t$, the attention weight vector $\mathbf{a}_t \in \mathbb{R}^t$ must sum to 1 by construction of the softmax. When $t$ is large, the model faces a distributional constraint: even if no previous token is semantically relevant to the current generation step, the probability mass must be allocated somewhere. The softmax cannot output a near-zero vector; it must produce a valid probability distribution.
This constraint creates a thermodynamic pressure to designate dump tokens—positions that absorb excess probability mass when no semantically meaningful token should receive high attention. The initial tokens, particularly position 0, are structurally privileged candidates for this role: every token in the sequence can attend to them (they are never masked out in causal attention), and if the model learns early in training to route “irrelevant” attention mass to position 0, this pattern is reinforced across all layers through gradient descent.
Formally, let $\mathbf{q}_t, \mathbf{k}_i \in \mathbb{R}^{d_k}$ be the query at position $t$ and key at position $i$. The attention logit is:
$$e_{t,i} = \frac{\mathbf{q}_t \cdot \mathbf{k}_i}{\sqrt{d_k}}$$
If the model learns to set $\mathbf{k}_0$ such that $e_{t,0}$ is consistently high regardless of $\mathbf{q}_t$, position 0 becomes a universal sink. This is achievable by training $\mathbf{k}_0$ to align with the principal component of the query distribution—precisely the regularization pressure that emerges when no specific semantic target is present. Xiao et al. (2023) empirically verified that replacing the sink token with a learnable dummy token (“attention anchor”) inserted at position 0 preserves model quality while enabling cache eviction of the original sequence start.
3.2 RoPE Geometry at Long Sequence Lengths
Rotary Position Embeddings encode position by rotating query and key vectors:
$$\mathbf{q}_m \leftarrow R_m \mathbf{q}, \quad \mathbf{k}_n \leftarrow R_n \mathbf{k}$$
where $R_m \in \mathbb{R}^{d \times d}$ is a block-diagonal rotation matrix with $2 \times 2$ blocks of the form:
$$R_m^{(j)} = \begin{pmatrix} \cos(m\theta_j) & -\sin(m\theta_j) \\ \sin(m\theta_j) & \cos(m\theta_j) \end{pmatrix}$$
and $\theta_j = 10000^{-2j/d}$ is the frequency for dimension pair $j$. The inner product between rotated query and key depends only on relative position $m – n$:
$$\langle R_m \mathbf{q}, R_n \mathbf{k} \rangle = \langle R_{m-n} \mathbf{q}, \mathbf{k} \rangle$$
This relative-position property is desirable for generalization. However, for large relative positions $(m – n) \gg L_{\text{train}}$ (where $L_{\text{train}}$ is the training context length), the rotation angles $m\theta_j$ for low-frequency dimensions (small $j$) enter regimes never seen during training. The resulting inner products are out-of-distribution: the dot product $\langle R_{m-n}\mathbf{q}, \mathbf{k}\rangle$ does not decay gracefully with distance but instead oscillates, producing unreliable attention scores.
Position interpolation (Chen et al., 2023) addresses this by rescaling positions as $m \rightarrow m \cdot L_{\text{train}} / L_{\text{target}}$, compressing the position range to values seen during training. The method reduces oscillation artifacts but requires a fine-tuning phase and does not eliminate the fundamental issue that the model’s semantic use of positional information was calibrated on shorter sequences.
3.3 The U-Shaped Performance Curve: A Geometric Account
Let a context of length $L$ contain a single relevant document at position $p$. Define effective attention weight $w(p)$ as the total attention probability mass directed to tokens in the document during generation. The lost-in-the-middle phenomenon corresponds to the empirical finding:
$$w(p) \approx \begin{cases} w_\text{high} & p \approx 0 \text{ (primacy)} \\ w_\text{high} & p \approx L \text{ (recency)} \\ w_\text{low} & p \approx L/2 \text{ (middle)} \end{cases}$$
The primacy advantage is partially explained by the attention sink mechanism: tokens near position 0 receive elevated baseline attention across all layers. The recency advantage reflects RoPE’s decay behavior: for a query at the final layer’s output position, the relative distance to recent tokens is small, keeping rotation angles within the well-calibrated training regime.
Critically, middle positions suffer from both effects simultaneously. They are too far from position 0 to benefit from sink proximity, and too far from the decoding position to benefit from low-distance RoPE stability. The softmax attention score for a middle token involves large-magnitude relative positions, which produce unreliable dot products, leading the model to de-emphasize those tokens in favor of more reliably scored sink or recent tokens.
This account predicts that the U-shape should deepen as context length increases (larger distances, more out-of-distribution positions) and should be alleviated by positional encoding schemes with better long-range properties (e.g., ALiBi), both of which are confirmed empirically.
3.4 Connection to Outlier Features
Dettmers et al. (2022) observed that a small set of embedding dimensions—typically fewer than 0.1% of all dimensions—exhibit magnitudes 100× larger than typical. These outlier features emerge around model scale thresholds (~6.7B parameters for autoregressive LLMs) and are hypothesized to encode token-level properties like position or syntactic role rather than semantic content.
A mechanistic connection to attention sinks follows: if $\mathbf{k}_0$ develops large-magnitude outlier components, the dot product $\mathbf{q}_t \cdot \mathbf{k}_0$ will be dominated by these outliers and will be consistently large regardless of the semantic content of $\mathbf{q}_t$. This provides a concrete implementation of the dump token mechanism at the representational level: the sink behavior is not an emergent accident but a structural consequence of the model learning to solve the softmax calibration problem using high-magnitude key dimensions.
4. Discussion
4.1 Implications for Long-Context Evaluation
The lost-in-the-middle phenomenon has immediate practical implications for benchmarking. Evaluations that always place the answer-bearing passage at the beginning of the context will systematically overestimate model performance on realistic retrieval tasks. Several benchmarks have been criticized on this basis: SCROLLS (Shaham et al., 2022), LONGBENCH (Bai et al., 2023), and NEEDLE-IN-A-HAYSTACK tests that use fixed insertion positions are all susceptible to position-based confounds.
Robust long-context evaluation requires averaging over insertion positions, testing the full U-curve, and reporting performance breakdowns by position band. The position-stratified evaluation protocol proposed by Liu et al. (2023) is a minimal standard that the field should adopt broadly.
4.2 Mitigation Strategies
StreamingLLM (Xiao et al., 2023) addresses attention sinks for inference efficiency: by retaining the KV cache for sink tokens plus a sliding window, streaming generation is possible without recomputation. However, it does not address the lost-in-the-middle problem—information outside the sliding window is simply dropped.
Position interpolation and NTK-aware scaling improve out-of-distribution RoPE behavior and partially alleviate middle-context degradation, but require fine-tuning and do not eliminate the primacy/recency biases encoded in the model’s learned attention calibration.
Retrieval augmentation sidesteps long-context limitations by pre-selecting the most relevant documents and placing them at privileged positions (typically the beginning). This is effective but reintroduces retrieval latency and recall limitations.
Attention modification approaches include attention temperature scaling (cooling the softmax to distribute mass more evenly), explicit attention bias terms that penalize extreme proximity (ALiBi), and position-aware attention masking (Han et al., 2023). Each has tradeoffs in training cost and performance.
Fine-tuning on long-context data with shuffled positions is perhaps the most principled approach: if training examples are constructed so that relevant information appears uniformly at random positions, the model is forced to learn position-invariant retrieval. Evidence from models like LongLoRA (Chen et al., 2023b) and Yarn (Peng et al., 2023) suggests this is effective, though it requires non-trivial data curation.
4.3 Broader Significance for Agentic and RAG Systems
The practical impact of these phenomena extends beyond benchmarking. In agentic AI systems that inject long tool outputs or conversation histories into context, the middle-context degradation directly translates to unreliable utilization of tool results and prior context. A model executing a multi-step reasoning plan may fail to reference critical intermediate outputs simply because they were generated many tokens ago and are now in the “lost” middle region of the context.
For RAG systems, the conventional strategy of concatenating multiple retrieved passages in decreasing relevance order may be counterproductive: the most relevant passage is placed at the beginning (primacy advantage), but if several passages are relevant and some are more relevant than others, the middle passages may be systematically underweighted. Reranking strategies that place highly relevant passages at position 0 or the final position, rather than strictly descending order, have shown empirical benefits consistent with this analysis.
5. Conclusion
Attention sinks and the lost-in-the-middle phenomenon are not mere quirks of current models—they are structurally motivated by the combination of causal softmax constraints, rotary positional encoding geometry, and training distribution mismatch. The softmax’s requirement to produce a valid probability distribution over all attended positions creates persistent pressure to designate dump tokens, and the geometry of RoPE at large relative positions produces unreliable attention scores that further disadvantage middle-context tokens.
Addressing these phenomena requires interventions at multiple levels: positional encoding design, fine-tuning data construction, evaluation protocol standards, and inference-time attention manipulation. None of the current mitigations is fully satisfying, and the fundamental tension between the softmax’s normalization constraint and the need for position-invariant information access remains unresolved.
The broader lesson is that context length, as currently advertised by model vendors, measures a necessary but not sufficient condition for long-context competence. A model with a 1M token context window that exhibits severe middle-context degradation is, for practical retrieval purposes, more accurately characterized by its effective utilization window—which may be considerably shorter. Developing metrics and training methods that close the gap between nominal and effective context length is one of the central open problems in the engineering and science of large language models.
References
- Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12, 157–173.
- Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2023). Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
- Press, O., Smith, N. A., & Lewis, M. (2022). Train short, test long: Attention with linear biases enables input length extrapolation. International Conference on Learning Representations (ICLR 2022).
- Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2024). RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568, 127063.
- Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). LLM.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems (NeurIPS 2022).
- Han, C., Wang, Q., Xiong, W., Chen, Y., Ji, H., & Wang, S. (2023). LM-Infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137.
- Chen, S., Wong, S., Chen, L., & Tian, Y. (2023). Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2023). YaRN: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071.
- Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., & Li, J. (2023). LongBench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508.
- Shaham, U., Segal, E., Reid, M., Ainslie, J., Groeneveld, D., Jain, S., Yadav, C. Y., Liu, Z., Zettlemoyer, L., Tafjord, O., & Srikumar, V. (2022). SCROLLS: Standardized comparison over long language sequences. Proceedings of EMNLP 2022.