Grokking and Delayed Generalization in Neural Networks: Phase Transitions, Weight Norm Dynamics, and the Mechanisms of Late-Stage Representation Learning

Abstract

Grokking — the phenomenon whereby neural networks first overfit to training data and only later, after extensive additional training, transition to genuine generalization — poses a fundamental challenge to conventional learning theory. First documented systematically by Power et al. (2022) in the context of modular arithmetic tasks, grokking reveals a striking disconnect between training loss convergence and test accuracy, separated by orders of magnitude in gradient steps. This paper provides a rigorous technical analysis of the grokking phenomenon: we examine the mechanistic underpinnings of the delayed generalization transition, characterize the role of weight norm regularization and implicit bias in precipitating the phase transition, and survey the growing body of theoretical explanations including the slow generalization hypothesis, representation compression, and circuit formation. We further discuss the implications of grokking for standard early stopping heuristics, the relationship between grokking and double descent, and what the phenomenon reveals about the geometry of neural network optimization. Understanding grokking is not merely an academic curiosity — it illuminates the gap between memorization and understanding in learned representations.

1. Introduction

A central assumption of modern deep learning practice is that once training loss has converged and performance on a held-out validation set stabilizes, training may safely be halted. Early stopping, precisely calibrated by this assumption, is one of the most widely deployed regularization strategies in neural network training. The discovery of grokking challenges this assumption at its core.

Grokking refers to a delayed generalization transition: a network first achieves near-zero training loss — memorizing its training set — while test accuracy remains near chance. Then, after a period of apparent stagnation, test accuracy sharply increases, eventually matching or approaching training accuracy. The transition from memorization to generalization can require 10×, 100×, or even 1000× more gradient steps than the initial convergence to low training loss.

The phenomenon was formally characterized by Power et al. (2022) in experiments on algorithmic tasks, particularly modular addition: given inputs $a, b \in \mathbb{Z}_p$, the network learns to predict $(a + b) \mod p$. On small datasets with mild weight decay, the network memorizes the training set quickly but generalizes only after extended training. The sharpness and robustness of this transition across architectures, tasks, and hyperparameter configurations suggests it reflects something deep about the geometry of neural network learning, rather than a quirk of a particular setup.

Grokking matters for several interconnected reasons. First, it reveals that test loss and training loss can decouple dramatically — a trained network may be far closer to a generalizing solution than its test performance suggests. Second, it raises questions about what is actually being optimized during the post-memorization phase: the loss is flat, yet the network is changing. Third, grokking demonstrates that implicit biases in gradient descent — particularly the tendency toward low-weight-norm solutions — may be the primary driver of generalization in overparameterized settings, rather than the loss function itself.

This paper is structured as follows. Section 2 surveys prior work on related phenomena including double descent, implicit regularization, and phase transitions in learning. Section 3 provides a technical analysis of the mechanisms underlying grokking, including weight norm dynamics, circuit formation, and information-theoretic perspectives. Section 4 discusses broader implications and open questions. Section 5 concludes.

2. Related Work

Grokking does not exist in isolation; it is part of a broader constellation of findings that challenge classical bias-variance intuitions about neural network generalization.

Double descent. Belkin et al. (2019) documented the double descent phenomenon: as model capacity increases, test error first decreases (classical regime), then increases (overfitting), then decreases again to match or exceed the best classical model (modern interpolating regime). This non-monotonic behavior demonstrates that interpolation of training data — traditionally associated with overfitting — can be compatible with strong generalization. Grokking extends this picture by demonstrating a similar non-monotonicity in the temporal domain: the evolution of a single model over training time, rather than across model capacities.

Implicit regularization in gradient descent. Soudry et al. (2018) proved that gradient descent on linearly separable data converges in direction to the maximum-margin classifier, even in the absence of explicit regularization. Gunasekar et al. (2017) established analogous implicit bias results for matrix factorization problems. These results establish that the choice of optimizer imposes geometric biases on the solution found, independently of the loss function. Grokking appears to be a manifestation of exactly this implicit bias: extended training allows gradient descent’s bias toward minimum-norm solutions to eventually dominate, forcing a transition to a generalizing solution.

Phase transitions in learning. Saxe et al. (2014) showed that learning in deep linear networks proceeds through a sequence of singular value decompositions in a cascade-like fashion, with different components of the input-output mapping learned at very different timescales. This work established that phase-transition-like behavior — where qualitatively different learned representations emerge discretely rather than continuously — is a structural feature of deep network training dynamics, not an anomaly.

Mechanistic interpretability and circuit discovery. Olah et al. (2020) and subsequent work from the circuits thread of mechanistic interpretability demonstrated that neural networks implement specific algorithms — circuits — through identifiable subsets of weights and activation patterns. Nanda et al. (2023) conducted a detailed mechanistic analysis of grokking in modular arithmetic, demonstrating that the generalizing network implements a specific Fourier-based algorithm for modular addition, and that this circuit forms gradually during the post-memorization training phase.

Representation compression and the information bottleneck. Tishby and Schwartz-Ziv (2017) proposed that training in deep networks proceeds in two phases: first, fitting the labels (increasing mutual information between representations and outputs); second, compressing representations (reducing mutual information between representations and inputs). While subsequent work by Saxe et al. (2018) showed that this compression phase is not universal, the framework provides one lens through which to interpret the delayed generalization in grokking as a compression event.

3. Technical Analysis

3.1 Formalizing the Grokking Transition

Let $f_\theta : \mathcal{X} \to \mathcal{Y}$ denote a neural network with parameters $\theta \in \mathbb{R}^d$, trained on a dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n$ sampled from a distribution $P_{XY}$. Define the training loss $\mathcal{L}_{\text{train}}(\theta)$ and population loss $\mathcal{L}_{\text{pop}}(\theta)$. Grokking is characterized by the existence of a time $T_m \ll T_g$ such that:

$$\mathcal{L}_{\text{train}}(\theta_{T_m}) \approx 0, \quad \mathcal{L}_{\text{pop}}(\theta_{T_m}) \approx \mathcal{L}_{\text{chance}}$$
$$\mathcal{L}_{\text{train}}(\theta_{T_g}) \approx 0, \quad \mathcal{L}_{\text{pop}}(\theta_{T_g}) \approx \mathcal{L}_{\text{opt}}$$

where $T_m$ is the memorization time, $T_g$ is the generalization time, $\mathcal{L}_{\text{chance}}$ is the loss of a random predictor, and $\mathcal{L}_{\text{opt}}$ is near-optimal loss. The grokking ratio $T_g / T_m$ can be large — often $10^2$ to $10^3$ in empirical settings — and depends strongly on dataset size, weight decay coefficient, and task structure.

3.2 Weight Norm as an Order Parameter

A key empirical observation is that the weight norm $\|\theta\|_2$ serves as a near-perfect predictor of the generalization transition. During the post-memorization phase, weight decay continues to reduce $\|\theta\|_2$ even as the loss gradient is approximately zero. When $\|\theta\|_2$ crosses a critical threshold $\|\theta\|_c$, generalization occurs.

More precisely, let $\lambda$ be the weight decay coefficient in the objective:

$$\mathcal{L}_{\text{wd}}(\theta) = \mathcal{L}_{\text{train}}(\theta) + \frac{\lambda}{2} \|\theta\|_2^2$$

After memorization, $\nabla_\theta \mathcal{L}_{\text{train}}(\theta) \approx 0$, so parameter updates are dominated by the weight decay term: $\Delta \theta \approx -\eta \lambda \theta$. This yields exponential decay of the weight norm:

$$\|\theta_t\|_2 \approx \|\theta_{T_m}\|_2 \cdot e^{-\eta \lambda (t – T_m)}$$

The generalization time is thus approximately:

$$T_g – T_m \approx \frac{1}{\eta \lambda} \ln\left(\frac{\|\theta_{T_m}\|_2}{\|\theta_c\|_2}\right)$$

This formula explains the strong dependence of grokking on $\lambda$: larger weight decay accelerates the transition, smaller weight decay delays it indefinitely, and with no weight decay, grokking may never occur in finite training time. It also explains why grokking is less pronounced in smaller models: small models have lower-norm memorizing solutions, so $\|\theta_{T_m}\|_2 / \|\theta_c\|_2$ is smaller.

3.3 Circuit Formation and Representation Change

The weight norm perspective explains the timing of grokking, but not its mechanism. What actually changes in the network representation during the post-memorization phase? Nanda et al. (2023) provide the most detailed mechanistic account to date, analyzing a transformer trained on modular addition with $p = 113$.

Their analysis reveals that the generalizing network implements a specific algorithm based on Fourier transforms over $\mathbb{Z}_p$. Specifically, the network learns to represent the embedding of input $a$ as a superposition of Fourier components:

$$\text{embed}(a) \approx \sum_{k \in K} \alpha_k \cos\left(\frac{2\pi k a}{p}\right) + \beta_k \sin\left(\frac{2\pi k a}{p}\right)$$

for a sparse set of frequencies $K$, and then uses the attention mechanism to compute the product of Fourier components of $a$ and $b$, recovering $(a+b) \mod p$ via the convolution theorem. This algorithm is exact and fully general — it does not memorize specific input pairs but rather implements the underlying group structure.

The transition to this Fourier-based circuit is gradual during the post-memorization phase. Nanda et al. introduce the concept of “excluded loss” — the loss attributable to the memorizing component of the network versus the generalizing component — and show that these two computational strategies coexist in the network for much of the grokking transition, with the generalizing Fourier circuit gradually coming to dominate as weights are compressed.

3.4 The Slow Generalization Hypothesis

Liu et al. (2022) propose the “slow generalization” hypothesis: grokking reflects the competition between two simultaneously present solutions — a memorizing solution and a generalizing solution — where the generalizing solution is simpler (lower effective description length) but takes longer to find. Weight decay acts as a pressure that eventually selects for the simpler solution.

This can be formalized through the lens of minimum description length (MDL) or Kolmogorov complexity. Let $C_m$ and $C_g$ denote the description lengths of the memorizing and generalizing solutions respectively. For structured tasks, $C_g \ll C_m$, but gradient descent finds $C_m$ faster because it is a closer initialization to the initial random network.

The effective regularization implicit in the weight decay term provides a pressure toward $C_g$. The magnitude of this pressure is $\lambda \|\theta\|_2^2 / 2$, which is a proxy for model complexity. When this term is sufficient to distinguish the two solutions — i.e., when $\mathcal{L}_{\text{wd}}(\theta_m) > \mathcal{L}_{\text{wd}}(\theta_g)$ despite $\mathcal{L}_{\text{train}}(\theta_m) \approx \mathcal{L}_{\text{train}}(\theta_g) \approx 0$ — gradient descent transitions to $\theta_g$.

3.5 Relationship to Sharpness and Flat Minima

An alternative but complementary account links grokking to the sharpness of the loss landscape minimum. Foret et al. (2021) introduced sharpness-aware minimization (SAM), demonstrating that flatter minima generalize better. During grokking, the memorizing solution typically occupies a sharper minimum than the generalizing solution — it is finely tuned to the specific training examples and is not robust to perturbations.

Define the sharpness of a solution $\theta$ as:

$$S(\theta) = \max_{\|\epsilon\|_2 \leq \rho} \mathcal{L}_{\text{train}}(\theta + \epsilon) – \mathcal{L}_{\text{train}}(\theta)$$

Empirically, the sharpness of $\theta$ decreases monotonically through the grokking transition, with the generalizing solution occupying a significantly flatter region of the loss landscape. The progressive weight norm reduction during the post-memorization phase can be understood as a gradient flow toward flatter regions, since flat minima are more compatible with the global geometry of the weight decay objective.

3.6 Grokking Without Weight Decay

Recent work by Varma et al. (2023) demonstrates that grokking can occur without explicit weight decay, through other forms of implicit or explicit regularization. Dropout, data augmentation, and even simple gradient noise can trigger delayed generalization transitions. This suggests that weight decay is one specific mechanism for imposing the pressure toward simpler solutions, but not the only one.

In the dropout setting, the implicit regularization effect can be characterized through the expected loss under parameter perturbations:

$$\mathbb{E}_{\text{dropout}}[\mathcal{L}(\theta \odot m)] \approx \mathcal{L}(\theta) + \frac{p(1-p)}{2} \sum_i \theta_i^2 \|\nabla_{\theta_i} \mathcal{L}(\theta)\|_2^2$$

where $m$ is a Bernoulli mask with parameter $1-p$. At the memorizing solution where $\nabla \mathcal{L} \approx 0$, this reduces to a weight-norm-like term, explaining the qualitative similarity to explicit weight decay grokking.

4. Discussion

4.1 Implications for Early Stopping

The most immediate practical implication of grokking is that early stopping based on validation loss plateau is fundamentally unreliable when grokking-like dynamics are possible. A network that has memorized its training set and shows flat validation performance may be 100,000 gradient steps away from excellent generalization. This is particularly concerning for small dataset regimes, where grokking is most pronounced.

One response is to use weight norm as a stopping criterion rather than validation loss. Empirically, the weight norm trajectory is a more reliable predictor of the proximity to the generalization transition. Alternatively, aggressive weight decay can dramatically accelerate the transition at the cost of potential underfitting.

More broadly, grokking suggests that the evaluation of neural networks should include not just their test performance at the end of training, but some assessment of whether the learning dynamics are consistent with genuine generalization or memorization. The distinction between the two may require analysis of the learned representations — e.g., their Fourier structure for algorithmic tasks — rather than just loss metrics.

4.2 Grokking in Large Language Models

Whether grokking occurs in large language model pretraining is an open and important question. The conditions for grokking — structured tasks, small datasets relative to model capacity, explicit or implicit regularization — are in some ways very different from LLM pretraining on hundreds of billions of tokens. However, several lines of evidence suggest grokking-like dynamics may be relevant.

First, the emergence of in-context learning and reasoning capabilities after long pretraining has some structural similarity to grokking: a capability that appears discontinuously after a period of apparent stability. Wei et al. (2022) document such emergent abilities in large models, and the mechanistic account is still incomplete. Second, analytic tasks in LLMs — arithmetic, formal logic, code execution — show patterns consistent with circuit formation of the kind documented by Nanda et al. (2023). Third, the implicit bias of Adam toward low-norm solutions parallels the explicit weight decay mechanism identified in grokking.

Identifying and characterizing grokking-like transitions in LLMs would be significant both scientifically and practically. If important capabilities are gated behind such transitions, then standard training curves underestimate the potential of extended training, and stopping rules calibrated to loss convergence may be missing substantial gains.

4.3 The Role of Data Efficiency

Grokking exhibits a striking dependence on dataset size. With sufficiently large datasets, the memorizing solution requires parameters that are so numerous and so precisely tuned that its weight norm is already larger than $\|\theta_c\|_2$, and grokking occurs rapidly or not at all. The grokking phenomenon is most extreme in the small-data, large-model regime — precisely the regime that is of interest for few-shot learning and data-efficient adaptation.

This suggests a connection between grokking and the data efficiency of fine-tuning methods. Low-rank adaptation (LoRA) and other parameter-efficient fine-tuning approaches constrain the effective dimensionality of the weight update, which may be interpretable as an implicit weight norm constraint on the delta weights. Whether PEFT methods induce or suppress grokking-like transitions during fine-tuning is an empirically open question with practical implications for fine-tuning heuristics.

4.4 Open Problems

Several important questions remain open. First, the theoretical conditions under which grokking occurs — the precise relationship between task structure, dataset size, model capacity, and optimizer hyperparameters — are not fully characterized. The weight norm account and the circuit formation account are complementary but not yet unified into a single quantitative theory. Second, the relationship between grokking and double descent is suggestive but not precise: both involve non-monotonic generalization dynamics, but the axes (training time vs. model capacity) are different, and it is unclear whether they share a common mechanism. Third, the occurrence of grokking in practical large-scale training — for both LLMs and other large models — has not been systematically studied. Fourth, the implications for sample complexity theory are not fully worked out: grokking suggests that the effective sample complexity of a task depends not just on the complexity of the target function, but on the implicit regularization dynamics of the optimizer.

5. Conclusion

Grokking is a striking empirical phenomenon that reveals deep tensions in our understanding of neural network learning. By demonstrating that memorization and generalization can be separated by orders of magnitude in training time, it forces a reconsideration of standard training heuristics and theoretical frameworks. The mechanistic analysis of grokking — through the lens of weight norm dynamics, circuit formation, and implicit bias — provides a coherent if incomplete account of why and when delayed generalization occurs.

The key insight is that generalization in neural networks is not merely a consequence of fitting training data, but of fitting training data with a solution of appropriate simplicity. The tension between the speed of memorization and the pressure toward simple solutions is what gives rise to the grokking transition. Weight decay, dropout, and other regularization mechanisms are not merely guards against overfitting in the traditional sense; they are the forces that ultimately drive the learning system toward genuinely generalizing representations.

For practitioners, the principal takeaway is that early stopping calibrated to validation loss plateau is insufficient in small-data regimes. For theorists, grokking presents a challenge to any account of generalization that focuses only on the final trained model without considering the training dynamics. And for both, grokking is a reminder that understanding what neural networks learn requires looking inside the learned representations — not just at loss curves — and that the circuits implementing genuine generalization may take a very long time to emerge.

References

Sparse Autoencoders for Mechanistic Interpretability: Feature Discovery, Superposition, and the Dictionary Learning Approach to Language Model Internals
Mixture-of-Depths in Transformers: Dynamic Compute Allocation, Token Routing, and the Efficiency Frontier of Adaptive Computation

Leave a Comment

Your email address will not be published. Required fields are marked *