Scaling Laws and Emergent Capabilities in Large Language Models: Mechanisms, Predictions, and the Phase Transition Hypothesis

Abstract

Scaling laws describe how the performance of neural language models varies predictably with compute, parameter count, and dataset size. The empirical regularities discovered by Kaplan et al. (2020) and refined by Hoffmann et al. (2022) have fundamentally reshaped how large language models are trained and allocated resources. Yet scaling alone does not tell the whole story: as models grow, qualitatively new capabilities appear abruptly — a phenomenon termed emergence. This paper reviews the theoretical foundations of neural scaling laws, examines the empirical evidence for emergent capabilities, critically interrogates whether emergence reflects genuine phase transitions or is an artifact of evaluation methodology, and discusses the implications for model development, safety, and capability forecasting. We argue that the binary framing of emergence as “present” or “absent” obscures a richer picture of continuous capability growth with nonlinear evaluation surfaces.

1. Introduction

One of the most consequential empirical discoveries in deep learning over the past decade is that model performance does not improve in fits and starts — it improves predictably. Given sufficient data about how a model behaves at smaller scales, one can extrapolate, with surprising accuracy, how a larger model will behave. This predictability is encoded in scaling laws: power-law relationships between loss and the three primary resources of language model training — model parameters $N$, training tokens $D$, and total floating-point operations $C$.

The practical impact has been enormous. Rather than running expensive ablations at full scale, practitioners can run experiments at reduced scale, fit power laws, and extrapolate to resource-optimal configurations. The “Chinchilla” result (Hoffmann et al., 2022) — that most large models at the time were significantly undertrained relative to their parameter counts — was arrived at precisely through this kind of scaling analysis.

But scaling laws have a shadow: emergence. Wei et al. (2022b) documented a set of capabilities that appeared to be absent in smaller models and suddenly present in larger ones — not a gradual improvement, but what looked like a sharp phase transition. This observation has profound implications. If emergence is real, scaling laws may be insufficient for capability prediction: a model could be harmless at one scale and dangerous at the next, with no warning. Conversely, if emergence is an artifact, our evaluation frameworks may be systematically misleading us.

This paper is structured as follows. Section 2 reviews the literature on scaling laws and emergent capabilities. Section 3 provides a technical analysis of the power-law framework and the statistical arguments for and against phase transitions. Section 4 discusses implications for capability forecasting, safety, and benchmark design. Section 5 concludes.

2. Related Work

Kaplan et al. (2020) — “Scaling Laws for Neural Language Models” — established the foundational empirical regularities: test loss scales as a power law in $N$, $D$, and $C$, with exponents that appear stable across orders of magnitude. Crucially, they found that model size, not depth-to-width ratio, was the dominant factor, and that data requirements scale roughly as $D \propto N^{0.74}$.

Hoffmann et al. (2022) — “Training Compute-Optimal Large Language Models” (the Chinchilla paper) — revisited these scaling laws with a tighter experimental design and concluded that earlier large models (GPT-3, Gopher) were significantly parameter-rich but data-poor relative to the compute-optimal frontier. Their revised laws suggest equal scaling of $N$ and $D$ with compute.

Wei et al. (2022b) — “Emergent Abilities of Large Language Models” — surveyed a broad set of tasks and identified over 100 capabilities that appeared to emerge discontinuously as a function of scale. Tasks included multi-step arithmetic, chain-of-thought reasoning when prompted, and word-in-context disambiguation. The paper anchored the modern discourse on emergence.

Schaeffer et al. (2023) — “Are Emergent Abilities of Large Language Models a Mirage?” — challenged the emergence narrative by arguing that apparent phase transitions are an artifact of nonlinear or discontinuous evaluation metrics. When tasks are evaluated with smooth metrics (e.g., token-level cross-entropy rather than exact-match accuracy), emergent capabilities often resolve into smooth curves consistent with standard scaling laws.

Ganguli et al. (2022) — “Predictability and Surprise in Large Generative Models” — provided a more nuanced empirical picture, showing that some capabilities are genuinely unpredictable from smaller-scale experiments while others follow smooth extrapolation. They introduced the concept of “U-shaped” performance curves, where models first get worse at a task with scale before improving — a pattern not captured by monotone power laws.

Srivastava et al. (2022) — “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models” (BIG-Bench) — provided the largest-scale systematic evaluation of emergent capabilities across 204 tasks, with models ranging from millions to hundreds of billions of parameters. Their results show highly variable scaling behavior: some tasks improve smoothly, some flatline, and some show inflection points.

3. Technical Analysis

3.1 The Power-Law Framework

The basic scaling law for language model loss $L$ as a function of parameters $N$ takes the form:

$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}$$

where $N_c$ is a characteristic parameter count and $\alpha_N \approx 0.076$ (Kaplan et al., 2020). An analogous law holds for data:

$$L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}$$

with $\alpha_D \approx 0.095$. When both $N$ and $D$ are varied simultaneously under a fixed compute budget $C \approx 6ND$, the compute-optimal allocation satisfies:

$$N^* \propto C^{0.5}, \quad D^* \propto C^{0.5}$$

under Hoffmann et al.’s revised estimates. This suggests that parameter count and data should scale at equal rates — a significant departure from the earlier Kaplan et al. prescription that favored aggressively scaling $N$ while keeping $D$ relatively fixed.

3.2 What Does Emergence Mean Formally?

Let $f(N)$ denote model performance on some task as a function of scale. Emergence, in the informal sense of Wei et al. (2022b), means that there exists a threshold $N^*$ such that:

$$f(N) \approx 0 \quad \forall N < N^*, \qquad f(N) \gg 0 \quad \forall N > N^*$$

This is precisely the signature of a phase transition in statistical physics — a discontinuous or near-discontinuous change in a macroscopic observable at a critical point. The analogy is seductive. But it raises an immediate question: is the discontinuity real, or a measurement artifact?

3.3 The Metric Dependence of Emergence

Schaeffer et al. (2023) make the following observation. Suppose true model capability $c(N)$ grows smoothly with scale — say, $c(N) = aN^\beta$ for some $\beta > 0$. Now suppose we evaluate with a threshold metric:

$$\text{acc}(N) = \mathbb{1}[c(N) > \theta]$$

This metric is zero for all $N$ below the threshold and one above. It will appear to “emerge” discontinuously even if $c(N)$ is smooth. The same argument applies to exact-match accuracy on multi-step tasks: a model that gets 9 out of 10 steps right still scores 0 on the overall task. As model scale increases and step-level accuracy crosses 90%, overall accuracy jumps from near-zero to near-one.

This is not merely a theoretical concern. Schaeffer et al. show empirically that several tasks listed by Wei et al. as emergent produce smooth scaling curves when re-evaluated with continuous metrics. The key insight is that emergence is a property of the (task, metric) pair, not of the model alone.

3.4 Where Emergence May Be Real

The Schaeffer et al. critique does not fully deflate the emergence phenomenon. There are at least three cases where sharp transitions resist the metric-artifact explanation:

Compositional generalization: Some tasks require combining learned sub-skills in novel ways. If each sub-skill requires a minimum representational capacity, and their composition requires all of them simultaneously, there may be a genuine threshold effect — the model must pass all sub-skill thresholds before the composed capability becomes functional.
Chain-of-thought prompting: Wei et al. (2022a) showed that chain-of-thought prompting improves performance on reasoning tasks only above approximately 100B parameters. This threshold is robust to metric choice because it manifests on tasks where continuous partial credit is available. The question is why CoT helps large models but not small ones.
U-shaped scaling curves: Ganguli et al. (2022) document tasks where intermediate-scale models perform worse than small models, with performance recovering at large scale. This non-monotone behavior cannot be explained by smooth power-law extrapolation and suggests genuinely discontinuous dynamics in the loss landscape.

3.5 Mechanistic Hypotheses

Why might capabilities emerge? Several mechanistic hypotheses have been advanced:

Superposition and polysemanticity (Elhage et al., 2022): In models with insufficient capacity, individual neurons must represent multiple features simultaneously. As capacity grows, the model can afford to dedicate neurons to cleaner, more compositional representations — enabling new capabilities that depend on precise feature separation.

Grokking (Power et al., 2022): Neural networks can exhibit a sharp transition from memorization to generalization long after training loss has converged. This suggests that the loss metric itself can be misleading — a model may appear to have learned something without having generalized, and generalization may arrive suddenly. If grokking occurs more readily in larger models, it could produce emergent generalization at scale.

Phase transitions in the loss landscape: There is theoretical work (e.g., from statistical physics of disordered systems) suggesting that high-dimensional optimization landscapes can exhibit sharp transitions as a function of parameter count or training time. Whether these theoretical transitions correspond to the empirical emergence observations remains an open research question.

3.6 Forecasting and the Extrapolation Problem

For practitioners, the stakes of this debate are high. If emergence is real and phase transitions occur at specific scales, then:

Small-scale experiments may radically underestimate capabilities at deployment scale.
Safety evaluations conducted at training-scale may miss capabilities that emerge only at inference scale in larger models.
The compute-optimal frontier computed from scaling laws may not account for discrete jumps in task performance near critical scales.

Formally, if we write the compute-optimal loss as $L^*(C)$, smooth scaling implies:

$$L^*(C) = \left(\frac{C_0}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.05$$

But if there are emergent capabilities $E_1, E_2, \ldots$ that appear at discrete compute thresholds $C_1 < C_2 < \ldots$, then aggregate task performance is not well described by any single power law — it is a piecewise function with discontinuities at each $C_i$. Forecasting in this regime requires identifying the locations of the discontinuities in advance, which is precisely what small-scale experiments cannot do reliably.

4. Discussion

4.1 Implications for Benchmark Design

The Schaeffer et al. critique implies that benchmark designers bear significant responsibility for whether emergence appears in evaluation results. Benchmarks built around exact-match accuracy on multi-step tasks are structurally likely to produce apparent emergence. This creates a subtle bias: tasks that look most impressive as demonstrations of emergent AI capability are precisely the tasks that exhibit this measurement artifact most strongly.

A more principled approach would evaluate partial credit or intermediate reasoning steps wherever possible, providing continuous curves that allow proper statistical analysis of whether an inflection point is present or whether a smooth power law fits the data. BIG-Bench Hard (Suzgun et al., 2022) partially addresses this by focusing on tasks where models below a threshold truly do perform near chance, but even here, careful metric design matters.

4.2 Implications for AI Safety

The emergence debate has direct implications for AI safety and capability forecasting. The “sharp left turn” scenario — in which a future AI system rapidly acquires dangerous capabilities at a scale discontinuity — is structurally similar to phase-transition emergence. If emergence is largely a metric artifact, this scenario becomes less likely: capability growth would be smooth, giving more time for monitoring and intervention. If genuine phase transitions exist, the opposite conclusion follows.

Currently, empirical evidence is ambiguous. Some capabilities (particularly those requiring multi-step composition or precise factual retrieval) show signatures of genuine threshold effects even under continuous evaluation. Others resolve into smooth curves. A conservative safety stance treats unknown capabilities as potentially emergent until proven otherwise.

4.3 The Data-Centric View

An underexplored angle is the role of data distribution in shaping scaling behavior. Most scaling laws are measured on models trained on similar data distributions (web text). If a capability requires rare compositional patterns in the training data, its emergence may reflect not a phase transition in the model but a threshold at which the effective sample count of relevant training examples becomes sufficient to support generalization. This data-centric view predicts that emergence should be more pronounced for capabilities with sparse data support — which is broadly consistent with observations, though it has not been rigorously tested.

4.4 Beyond Cross-Entropy Loss

A persistent limitation of scaling law research is its focus on cross-entropy loss on held-out web text. This is a convenient, smooth, and reliable proxy for many downstream capabilities. But it is known to be an imperfect predictor of task-specific performance, particularly for tasks that require structured reasoning, factual retrieval, or multi-step composition. Future scaling research would benefit from developing alternative aggregate metrics that better capture the capabilities of practical interest — though this is methodologically difficult given the diversity of tasks and the cost of large-scale evaluations.

5. Conclusion

Scaling laws have proven to be one of the most practically useful discoveries in deep learning. They enable compute-optimal resource allocation, rough capability forecasting, and principled comparison of model architectures across scales. The Chinchilla analysis, grounded in careful scaling experiments, has already reshaped how frontier models are trained.

Emergent capabilities present a harder problem. The strongest version of the emergence claim — that capabilities appear discontinuously, unpredictably, and in ways fundamentally inaccessible to small-scale experiments — is not fully supported by current evidence. The Schaeffer et al. critique correctly identifies metric dependence as a confound, and many reported emergence phenomena resolve under continuous evaluation. But genuine phase transitions likely exist for a subset of capabilities, particularly those requiring compositional generalization across multiple sub-skills.

The right response is neither to dismiss emergence as illusion nor to treat all scaling transitions as phase changes. It is to invest in better evaluation methodology — continuous metrics, partial credit, step-level decomposition — and to treat the presence or absence of emergence as an empirical question on a per-capability, per-metric basis. For safety-critical applications, erring on the side of treating unknown capabilities as potentially emergent remains prudent until the mechanistic picture is clearer.

The scaling paradigm has given us powerful predictive tools. Its limits, and the phenomena that lie beyond those limits, are now the frontier.

References

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv:2001.08361.
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … & Sifre, L. (2022). Training compute-optimal large language models. arXiv:2203.15556.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022a). Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 2022.
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … & Fedus, W. (2022b). Emergent abilities of large language models. Transactions on Machine Learning Research.
Schaeffer, R., Miranda, B., & Koyejo, S. (2023). Are emergent abilities of large language models a mirage? NeurIPS 2023.
Ganguli, D., Hernandez, D., Lovitt, L., DasSarma, N., Henighan, T., Jones, A., … & Clark, J. (2022). Predictability and surprise in large generative models. FAccT 2022.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., … & Wu, Z. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615.
Elhage, N., Hume, T., Olsson, C., Nanda, N., Henighan, T., Johnston, S., … & Olah, C. (2022). Toy models of superposition. Transformer Circuits Thread.
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177.
Suzgun, M., Scales, N., Schärli, N., Gehrmann, S., Tay, Y., Chung, H. W., … & Wei, J. (2022). Challenging BIG-Bench tasks and whether chain-of-thought can solve them. arXiv:2210.09261.