Abstract
In-context learning (ICL) — the capacity of large language models to adapt to new tasks given only a handful of demonstrations in the prompt — has emerged as one of the most practically consequential and theoretically puzzling phenomena in modern NLP. Despite its empirical ubiquity, a rigorous mechanistic account of why it works, and under what conditions it generalizes, remains elusive. One influential line of reasoning posits that ICL implements a form of implicit Bayesian inference: the model maintains an implicit posterior over latent task hypotheses and updates this posterior as input-output demonstrations are processed. This paper surveys the theoretical foundations of this Bayesian interpretation, examines the empirical evidence for and against it, explores competing mechanistic accounts grounded in gradient descent dynamics, and identifies the open problems that must be resolved before any unified theory can be claimed. We argue that the Bayesian framing, while incomplete, provides the most coherent organizing framework for current evidence and points toward productive directions for future interpretability research.
1. Introduction
The standard paradigm for adapting a machine learning model to a new task requires gradient updates — typically thousands of them — guided by labeled examples. In-context learning (ICL) breaks with this paradigm entirely. A large language model presented with a prompt of the form $\{(x_1, y_1), (x_2, y_2), \ldots, (x_k, y_k), x_{\text{query}}\}$ will, with remarkable reliability, output a response consistent with the task structure implied by the demonstrations, without any modification to its parameters.
This capability was first strikingly demonstrated at scale with GPT-3 (Brown et al., 2020), though precursors were observed in smaller models. Since then, ICL has become a standard tool for practitioners and a major focus of theoretical investigation. The practical appeal is obvious: zero parameter updates, arbitrary task switching at inference time, low engineering overhead. The theoretical puzzle is profound: what computational process underlies this behavior, and why does it scale so favorably with model size?
Among the proposed explanations, the Bayesian interpretation due to Xie et al. (2022) has been particularly influential. Their argument is this: during pretraining, the model learns a prior over a latent concept space; at inference time, each in-context demonstration constitutes evidence that updates an implicit posterior; predictions are then made by marginalizing over this posterior. Crucially, this framing does not require that the model explicitly represent probability distributions — it only requires that the model’s behavior is consistent with such computations.
This paper examines the Bayesian ICL hypothesis with care. We organize our discussion around three questions: (1) What does the Bayesian account actually predict, formally? (2) What empirical evidence supports or falsifies these predictions? (3) What mechanistic evidence from interpretability research bears on the question?
2. Related Work
ICL as Bayesian inference. Xie et al. (2022) provide the foundational theoretical treatment, modeling ICL under a latent concept model in which a document is generated by first sampling a concept $c$ from a prior $p(c)$ and then sampling tokens conditioned on that concept. They show that, under this generative model, Bayes-optimal prediction is equivalent to the ICL algorithm.
ICL as gradient descent in the forward pass. Aky�rek et al. (2022) and von Oswald et al. (2023) argue that transformer attention layers can implement a form of in-context gradient descent. They show that a single linear attention layer can exactly implement one step of gradient descent on a least-squares objective, and that multi-layer transformers trained on regression problems learn to implicitly run multiple gradient steps.
Induction heads and algorithmic primitives. Olsson et al. (2022) identify induction heads — specific attention head circuits that implement prefix matching — as the mechanistic basis of ICL in small transformers. Their analysis is grounded in concrete circuit analysis rather than abstract functional equivalence arguments.
ICL is label-format sensitive, not label-content sensitive. Min et al. (2022) show experimentally that random label assignment in demonstrations often has little effect on ICL performance, suggesting the model may be learning the format or task structure rather than the ground-truth input-output mapping — a finding that sits uneasily with the strong Bayesian account.
Task retrieval vs. task learning. Pan et al. (2023) introduce a taxonomy distinguishing ICL variants where demonstrations activate a previously learned task representation (retrieval) from variants where the model genuinely learns a novel function from the examples (learning). They argue most practical ICL is closer to retrieval.
3. Technical Analysis
3.1 The Latent Concept Model
We follow Xie et al. (2022) in formalizing ICL under a generative model. Suppose documents in the pretraining corpus are generated by:
$$c \sim p(c), \quad x_1, x_2, \ldots, x_T \sim p(\cdot \mid c)$$
where $c$ is a latent concept drawn from some prior, and tokens are i.i.d. given the concept. A model trained on next-token prediction learns to approximate:
$$p(x_T \mid x_1, \ldots, x_{T-1}) = \sum_c p(x_T \mid c) \cdot p(c \mid x_1, \ldots, x_{T-1})$$
This is exactly Bayes’ theorem: the model marginalizes over the concept posterior $p(c \mid x_{1:T-1})$. When the prefix contains ICL demonstrations, evaluating this sum corresponds to posterior-weighted prediction over task hypotheses.
The elegance of this derivation depends critically on the i.i.d. assumption within documents. If within-document examples are drawn from a consistent latent concept, the posterior concentrates quickly as $k$ (the number of demonstrations) increases, yielding:
$$p(c \mid x_{1:k}) \propto p(c) \prod_{i=1}^{k} p(x_i \mid c)$$
Under mild identifiability conditions, this posterior converges to a point mass on the true concept as $k \to \infty$, predicting that ICL performance improves monotonically with the number of demonstrations — a prediction broadly, though not uniformly, confirmed empirically.
3.2 Token-Level vs. Task-Level Posteriors
A subtlety in the Bayesian formulation concerns the granularity at which the posterior is defined. The model operates on tokens; concepts are not directly observed. Two distinct Bayesian accounts are possible:
Token-level: The model maintains $p(x_t \mid x_{<t})$ by implicitly marginalizing over all possible continuations. This is simply next-token prediction and is trivially true of any language model.
Task-level: The model explicitly represents, in some functional sense, a posterior over a structured concept space $\mathcal{C}$ that corresponds to tasks or functions. This is the stronger and more interesting claim.
The distinction matters because the token-level account provides no predictive purchase — it is true by definition. The task-level account makes testable predictions: model activations should encode something like a posterior over tasks, this representation should update with each demonstration, and it should generalize coherently to new examples from the inferred task.
3.3 The Gradient Descent Equivalence
Aky�rek et al. (2022) and von Oswald et al. (2023) demonstrate that linear attention with a specific parameterization is equivalent to gradient descent on a linear regression problem. A linear attention layer computes:
$$\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V$$
In the linear (softmax-free) limit, with $Q = W_Q X$, $K = W_K X$, $V = W_V X$, this becomes equivalent to the ridge regression estimator:
$$\hat{\beta} = (X^\top X + \lambda I)^{-1} X^\top Y$$
This correspondence is mathematically rigorous for linear transformers on linear regression tasks. The question is whether it provides a faithful mechanistic account of ICL in real transformers. Real transformers use softmax attention, multi-head attention, and highly nonlinear MLP layers, none of which are captured by the linear analysis. Moreover, Bai et al. (2024) show that transformers trained on non-linear regression tasks implement algorithms more complex than gradient descent — sometimes resembling Newton’s method.
3.4 Bayesian vs. Gradient Descent: Reconciliation
The Bayesian and gradient descent accounts need not compete — they may be complementary descriptions at different levels of abstraction. The gradient descent account is mechanistic: it describes the computational primitive. The Bayesian account is functional: it describes the information-theoretic goal. A transformer could implement gradient descent as its primitive computational operation while that operation serves the purpose of posterior updating in a Bayesian sense.
If the pretraining task is Bayes-optimal next-token prediction, and the model achieves this via gradient-descent-like computation in the forward pass, then both accounts are simultaneously true at their respective levels of description. This reconciliation is proposed by M�ller et al. (2022) under the name prior-data fitted networks (PFNs), which trains transformer-like models to directly implement Bayesian inference on arbitrary priors.
3.5 Sensitivity to Demonstration Quality
A key prediction of the Bayesian account is that ICL performance should improve as demonstrations become more informative about the latent concept. Formally, if demonstrations $(x_i, y_i)$ have higher likelihood under the true concept $c^*$ than under alternatives, the posterior concentrates faster. Operationally, this predicts that:
- Demonstrations from the correct task distribution should outperform random ones
- More diverse demonstrations should yield better generalization
- Demonstrations with correct labels should outperform demonstrations with random labels
The first two predictions are robustly confirmed. The third is not — Min et al. (2022) find that random label assignments often have minimal impact on downstream performance. This is arguably the most serious empirical challenge to the strong Bayesian account.
However, Wei et al. (2023) showed that this finding is model-size dependent: large enough models (100B+ parameters) do override random labels and learn from correct labels in context. This suggests the weak Bayesian account (task structure recovery) may dominate in smaller models, while the strong account (genuine posterior updating over input-output mappings) requires scale.
4. Discussion
4.1 What the Bayesian Account Gets Right
The Bayesian framing correctly predicts several broad strokes of ICL behavior. Performance improves with the number of demonstrations, consistent with posterior concentration. Demonstrations representative of the target distribution help more. ICL generalizes across semantically similar task formats, consistent with a shared latent concept space. And crucially, the pretraining distribution shapes ICL capability in ways consistent with a learned prior: models trained on more diverse tasks show better few-shot transfer (Sanh et al., 2022).
The Bayesian account also provides a principled explanation for why scale matters. A model that has compressed more of the world’s task distribution into its weights has a richer prior over concepts, allowing more accurate posterior inference from fewer demonstrations. This aligns with the empirical observation that ICL capability scales more steeply with model size than base language modeling performance.
4.2 Where the Bayesian Account Struggles
Several phenomena are difficult to accommodate within a strict Bayesian framing. First, ICL is highly sensitive to demonstration format in ways that have no Bayesian explanation — specific separator tokens, capitalization patterns, or prompt templates can shift performance dramatically without changing semantic content (Lu et al., 2022). A Bayesian posterior over latent concepts should be insensitive to such surface-level formatting.
Second, ICL exhibits ordering effects: the order in which demonstrations are presented affects predictions in ways inconsistent with a fully Bayesian agent (Zhao et al., 2021). If demonstrations were i.i.d. given the concept, order should not matter.
Third, the Bayesian account implies that the model maintains a compressed representation of the posterior in its activations. Attempts to directly probe for this representation have yielded mixed results — some evidence for task-relevant information in residual stream activations, but nothing that cleanly resembles a probability distribution over concepts (Todd et al., 2023).
4.3 Implications for Interpretability
The debate over ICL mechanisms has direct implications for mechanistic interpretability. If the Bayesian account is correct, we should expect circuits whose function is recognizable as posterior updating — circuits that aggregate evidence, represent uncertainty, and integrate prior knowledge. If the gradient descent account is more accurate, we should find circuits that implement iterative optimization steps.
Current interpretability work is not yet at the resolution needed to adjudicate this question in large transformers. Induction heads (Olsson et al., 2022) are suggestive of pattern matching rather than Bayesian inference, but cannot alone explain complex task-level generalization. A more complete mechanistic account will require analysis of multi-layer circuits spanning both attention and MLP components.
4.4 Practical Implications
Understanding ICL mechanisms has direct practical consequences. If ICL is primarily task retrieval, the limiting factor is coverage of the pretraining distribution, and improvement requires richer pretraining data rather than architectural changes. If ICL implements genuine in-context optimization, architecture — particularly depth and width of attention layers — may be the primary lever.
For prompt engineering, the Bayesian account suggests demonstrations should be selected to maximize informativeness about the latent task — diverse, representative, and clearly labeled. The empirical evidence supports diversity and representativeness; the evidence on label quality is more nuanced and depends on model scale.
5. Conclusion
In-context learning is among the most surprising capabilities of large language models, and its theoretical explanation remains genuinely unsettled. The Bayesian inference account — in which ICL corresponds to implicit posterior inference over a latent concept space learned during pretraining — provides the most coherent high-level framework currently available. It correctly predicts several broad patterns: monotonic improvement with demonstration count, sensitivity to task distribution coverage, and the dependency of ICL capability on pretraining diversity.
At the same time, important empirical phenomena resist easy accommodation: format sensitivity, ordering effects, and the relative insensitivity to label correctness at moderate model scales. The competing mechanistic account — in which attention implements implicit gradient descent — is mathematically rigorous for simplified settings but unclear in its extension to real transformers.
The most likely resolution is a multi-level account in which gradient descent provides the primitive computational mechanism, and Bayesian inference provides the appropriate functional description of what goal that mechanism serves. Establishing this correspondence rigorously, and identifying the circuits that implement it in concrete transformers, is among the most important open problems in mechanistic interpretability.
The path forward requires three things: more precise formal models of the latent concept structure in real pretraining corpora; more sophisticated circuit analysis tools capable of analyzing multi-layer circuits in large transformers; and carefully designed behavioral experiments that distinguish competing mechanistic accounts. The prize is a principled theory of in-context learning — one that would not only explain why ICL works, but guide the design of models and training curricula that make it work better.
References
- Aky�rek, E., Schuurmans, D., Andreas, J., Ma, T., & Zhou, D. (2022). What learning algorithm is in-context learning? Investigations with linear models. ICLR 2023. arXiv:2211.15661.
- Bai, Y., Chen, F., Wang, H., Xiong, C., & Mei, S. (2024). Transformers as statisticians: Provable in-context learning with in-context algorithm selection. NeurIPS 2023. arXiv:2306.04637.
- Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. NeurIPS 2020. arXiv:2005.14165.
- Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2022). Fantastically ordered prompts and where to find them. ACL 2022. arXiv:2104.08786.
- Min, S., Lyu, X., Holtzman, A., et al. (2022). Rethinking the role of demonstrations: What makes in-context learning work? EMNLP 2022. arXiv:2202.12837.
- M�ller, S., Hollmann, N., Arango, S. P., Grabocka, J., & Hutter, F. (2022). Transformers can do Bayesian inference. ICLR 2022. arXiv:2112.10510.
- Olsson, C., Elhage, N., Nanda, N., et al. (2022). In-context learning and induction heads. Transformer Circuits Thread. arXiv:2209.11895.
- Pan, X., Mao, D., Lv, Z., et al. (2023). What in-context learning learns in-context: Disentangling task recognition and task learning. ACL Findings 2023. arXiv:2305.09731.
- Sanh, V., Webson, A., Raffel, C., et al. (2022). Multitask prompted training enables zero-shot task generalization. ICLR 2022. arXiv:2110.08207.
- Todd, E., Li, M. L., Sharma, A. S., Mueller, A., Wallace, B. C., & Bau, D. (2023). Function vectors in large language models. ICLR 2024. arXiv:2310.15213.
- von Oswald, J., Niklasson, E., Randazzo, E., et al. (2023). Transformers learn in-context by gradient descent. ICML 2023. arXiv:2212.07677.
- Wei, J., Wei, J., Tay, Y., et al. (2023). Larger language models do in-context learning differently. arXiv:2303.03846.
- Xie, S. M., Raghunathan, A., Liang, P., & Ma, T. (2022). An explanation of in-context learning as implicit Bayesian inference. ICLR 2022. arXiv:2111.02080.
- Zhao, Z., Wallace, E., Feng, S., Klein, D., & Singh, S. (2021). Calibrate before use: Improving few-shot performance of language models. ICML 2021. arXiv:2102.09690.