Layer Normalization in Transformers: Pre-LN vs. Post-LN, Gradient Flow, and the Mathematics of Training Stability

Abstract

Layer normalization (LayerNorm) is a cornerstone component of modern transformer architectures, yet its placement relative to attention and feed-forward sublayers has profound consequences for training dynamics, gradient flow, and ultimately model convergence. The two dominant variants — Post-LN, as introduced in the original Transformer (Vaswani et al., 2017), and Pre-LN, popularized through GPT-2 and subsequent large language models — differ not merely in where a normalization operation is inserted, but in the topology of the computational graph through which gradients must flow. This paper provides a rigorous technical analysis of both configurations. We derive gradient propagation equations for each variant, characterize their stability properties under deep stacking, and survey the empirical record on training behavior across model scales. We further examine recent generalizations including RMSNorm, QK-Norm, and adaptive normalization schemes used in diffusion-based and hybrid architectures. Our analysis demonstrates that Pre-LN’s superior training stability stems from bounded gradient norms across depth, while Post-LN’s capacity for higher final performance — when training succeeds — arises from its richer representational gradient signal. The trade-off is not merely practical but is grounded in measurable differences in loss landscape curvature and Lipschitz constants of the residual stream.

1. Introduction

The introduction of the Transformer architecture by Vaswani et al. (2017) brought with it a particular normalization convention that would, paradoxically, become one of the most quietly consequential design choices in deep learning: the placement of LayerNorm after each sublayer’s residual addition (Post-LN). This placement was largely inherited from earlier sequence modeling work without deep theoretical justification. Within a few years, practitioners discovered that this configuration, while theoretically sound, was notoriously difficult to train at scale without careful learning rate warmup. The community gradually migrated toward Pre-LN configurations — placing normalization before each sublayer’s computation — primarily for empirical stability reasons.

Yet this pragmatic transition left open a set of important theoretical questions. Why exactly does Post-LN destabilize training at depth? What formal guarantees can we provide for Pre-LN’s stability? And is this stability bought at the cost of representational power? These questions matter not merely as historical curiosities but because normalization design continues to actively shape frontier model architecture decisions. The choice between Pre-LN, Post-LN, and hybrid schemes (such as the sandwich norm used in some recent models) affects learning dynamics at billion-parameter scale, where re-running ablations is prohibitively expensive.

This paper provides a systematic treatment of LayerNorm placement from both theoretical and empirical perspectives. We begin with the formal definitions of both configurations, then derive the gradient flow through each, analyze the implications for training stability, and survey how the broader research community has quantified these differences. We close with a discussion of modern normalization variants and the open questions that remain in normalization theory for deep networks.

2. Related Work

Layer normalization was introduced by Ba et al. (2016) as an alternative to batch normalization that operates independently per sample, making it well-suited to variable-length sequence models. Unlike batch normalization, LayerNorm computes statistics over the feature dimension rather than the batch dimension, enabling stable training with small batch sizes and auto-regressive generation settings where batch statistics are ill-defined.

The original Transformer paper (Vaswani et al., 2017) adopted Post-LN, applying normalization after the residual addition: $x_{l+1} = \text{LN}(x_l + F_l(x_l))$. This formulation was employed in BERT (Devlin et al., 2019) and early sequence-to-sequence work, but required careful learning rate warmup schedules to avoid training instability in deep configurations.

Xiong et al. (2020) provided the first systematic theoretical analysis of Pre-LN versus Post-LN, demonstrating that the gradient norm in Post-LN grows exponentially with depth at initialization, while Pre-LN maintains bounded gradient norms. This work formalized the empirical intuition that Pre-LN networks can be trained without warmup and are generally more stable. Their main result showed that for Post-LN with $L$ layers, the expected gradient norm satisfies $\mathbb{E}[\|\nabla_{x_0} \mathcal{L}\|] \sim O(e^{L})$ under certain initialization assumptions, while Pre-LN achieves $O(\sqrt{L})$ or better.

Liu et al. (2020) empirically demonstrated that Post-LN’s instability manifests as exploding gradients at the lower layers early in training, and that warmup effectively acts as a gradient clipping mechanism that compensates for this initialization pathology. Their analysis of the Adam optimizer’s interaction with LayerNorm placement revealed that the adaptive learning rate in Adam partially — but not fully — mitigates Post-LN instability.

Zhang and Sennrich (2019) introduced RMSNorm, a simplified normalization that omits the mean subtraction step and rescales by root-mean-square alone: $\text{RMS}(x) = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2}$, followed by learnable scaling. This variant was motivated by the observation that re-centering contributes less to training stability than re-scaling, and reduces the computational overhead of normalization by roughly 40%. RMSNorm has been adopted in LLaMA (Touvron et al., 2023) and many subsequent open models.

Shleifer et al. (2021) proposed NormFormer, which adds additional normalization operations beyond the standard single-normalization-per-sublayer convention, finding improvements in both stability and final performance. Henry et al. (2020) analyzed Query-Key normalization (QK-Norm), where the query and key projections in attention are independently normalized before their dot product, addressing a distinct pathology of attention score magnitude growth at long context lengths.

3. Technical Analysis

3.1 Formal Definitions

Let $x_l \in \mathbb{R}^d$ denote the residual stream at layer $l$, and let $F_l: \mathbb{R}^d \to \mathbb{R}^d$ denote the sublayer function (attention or feed-forward). LayerNorm is defined as:

$$\text{LN}(x) = \gamma \odot \frac{x – \mu(x)}{\sigma(x) + \epsilon} + \beta$$

where $\mu(x) = \frac{1}{d}\sum_i x_i$, $\sigma(x) = \sqrt{\frac{1}{d}\sum_i (x_i – \mu(x))^2}$, and $\gamma, \beta \in \mathbb{R}^d$ are learned affine parameters. The two normalization placement conventions then define:

Post-LN: $x_{l+1} = \text{LN}(x_l + F_l(x_l))$

Pre-LN: $x_{l+1} = x_l + F_l(\text{LN}(x_l))$

These are subtly but critically different computational graphs. In Post-LN, the normalization acts on the sum of the residual and the sublayer output, placing the normalization outside the residual connection. In Pre-LN, the residual path $x_l \to x_{l+1}$ passes through no normalization — the normalization only affects the inputs to $F_l$, and the output of $F_l$ is added directly to the unnormalized residual.

3.2 Gradient Flow Analysis

To understand stability differences, we analyze the Jacobian of the loss with respect to the residual stream at lower layers. For a network with $L$ sublayers, define $J_l = \frac{\partial x_{l+1}}{\partial x_l}$. The gradient at layer $0$ is:

$$\frac{\partial \mathcal{L}}{\partial x_0} = \left(\prod_{l=0}^{L-1} J_l\right)^\top \frac{\partial \mathcal{L}}{\partial x_L}$$

For Post-LN, expanding the Jacobian:

$$J_l^{\text{Post}} = \frac{\partial \text{LN}(x_l + F_l(x_l))}{\partial x_l} = J_{\text{LN}} \cdot (I + J_{F_l})$$

At initialization, $J_{F_l} \approx 0$ (small weight magnitudes) and $J_{\text{LN}}$ has a specific structure. The key insight from Xiong et al. (2020) is that $J_{\text{LN}}$ is not the identity — it is a projection operator that removes the mean component and rescales, with operator norm that depends on the variance of $x_l + F_l(x_l)$. At early training, this creates non-trivial gradient amplification or attenuation depending on the activation statistics at each layer.

For Pre-LN, the Jacobian decomposes cleanly:

$$J_l^{\text{Pre}} = I + J_{F_l} \cdot J_{\text{LN}}$$

The identity matrix in the sum is the direct path through the residual connection, which is entirely free of normalization. This means that even if $J_{F_l} \cdot J_{\text{LN}}$ is poorly conditioned, the identity term ensures a minimum singular value of at least 1 for the full Jacobian. Gradients flow backward through this path without distortion. The product over $L$ layers thus satisfies:

$$\left\|\prod_{l=0}^{L-1} J_l^{\text{Pre}}\right\| \leq \prod_{l=0}^{L-1} (1 + \|J_{F_l}\| \cdot \|J_{\text{LN}}\|)$$

At initialization where $\|J_{F_l}\|$ is small, this product grows at most polynomially with $L$ rather than exponentially. This polynomial vs. exponential scaling is precisely the theoretical basis for Pre-LN’s superior training stability.

3.3 Mean Field Theory Perspective

A complementary perspective comes from mean field theory applied to deep networks. At initialization, the forward pass through a Post-LN network progressively loses the identity between the input representation and the deep residual stream. The normalization at each layer resets the signal statistics, but does so in a way that couples information across depth in a manner that saturates at scale.

Specifically, for Post-LN, the effective depth of the gradient signal as seen from the loss diminishes rapidly. Define the effective gradient depth as the expected number of layers that contribute non-negligible gradient signal to the first layer. For Post-LN, this quantity saturates at a depth much smaller than $L$ when $L$ is large, effectively truncating the usable gradient signal. For Pre-LN, the identity residual path ensures that all $L$ layers receive comparable gradient signal magnitude, making the full depth of the network accessible for optimization.

3.4 Implications for Training Dynamics

The gradient norm difference has measurable consequences for training dynamics. In Post-LN networks, the lower layers (closer to the input embedding) receive vanishingly small gradients at initialization, effectively freezing their weights while upper layers train. The learning rate warmup schedule works around this by starting with a very small learning rate (where the large upper-layer gradients do not cause instability) and gradually increasing it (so that lower layers can eventually receive sufficient gradient signal as upper layers partially saturate).

This dynamic creates a fundamentally different optimization trajectory compared to Pre-LN. In Pre-LN, all layers begin training simultaneously from initialization, which leads to faster early-epoch progress but can also lead to the upper layers becoming overspecialized before lower layers have adapted. This has been observed as a slight but consistent final performance gap: when trained to convergence, Post-LN models often achieve marginally better perplexity on language modeling tasks, presumably because the more uniform gradient signal of Pre-LN does not allow the same degree of layer specialization that Post-LN’s more heterogeneous training enables.

3.5 RMSNorm and Simplified Variants

RMSNorm removes the mean-subtraction step, defining:

$$\text{RMSNorm}(x) = \gamma \odot \frac{x}{\|x\|_2 / \sqrt{d} + \epsilon}$$

The Jacobian of RMSNorm with respect to $x$ is:

$$\frac{\partial \text{RMSNorm}(x)}{\partial x} = \frac{\sqrt{d}}{\|x\|_2} \left(I – \frac{xx^\top}{\|x\|_2^2}\right) \cdot \text{diag}(\gamma)$$

This is a projection onto the hyperplane orthogonal to $x$, scaled by $\gamma$ and by the inverse of the RMS. Like LayerNorm, the Jacobian is not the identity, but the absence of mean subtraction removes one source of coupling between features and slightly simplifies the gradient computation. In practice, RMSNorm has been found to perform comparably to LayerNorm while requiring fewer FLOPs, and its adoption in LLaMA and Mistral architectures has made it a de facto standard in recent open-source models.

3.6 Query-Key Normalization

A distinct pathology arises in long-context attention that LayerNorm placement cannot address: the magnitude of attention logits $q_i^\top k_j$ can grow proportionally to the embedding dimension and the magnitudes of the weight matrices, causing attention entropy collapse where all probability mass concentrates on a single position. QK-Norm addresses this by applying independent $L_2$ normalization to queries and keys before their dot product:

$$\text{Attn}(Q, K, V) = \text{softmax}\left(\frac{\hat{Q}\hat{K}^\top}{\tau}\right)V, \quad \hat{Q} = \frac{Q}{\|Q\|_2}, \quad \hat{K} = \frac{K}{\|K\|_2}$$

where $\tau$ is a learnable temperature parameter. This bounds the pre-softmax logit magnitudes to $[-1, 1]$ before scaling by $\tau$, decoupling attention concentration from the scale of learned projections. QK-Norm is increasingly used in long-context and high-resolution models (e.g., in diffusion transformer architectures) where attention logit growth is a first-order training concern.

4. Discussion

4.1 The Stability-Performance Trade-off

The empirical record presents a consistent picture: Pre-LN networks train more reliably and converge with minimal hyperparameter tuning, while Post-LN networks, when successfully trained, tend to achieve marginally better final performance. Xiong et al. (2020) observed this pattern in machine translation; subsequent work in language modeling has largely confirmed it.

The mechanistic explanation for Post-LN’s performance advantage, when it can be obtained, likely relates to the richer gradient signal it provides. Because Post-LN applies normalization outside the residual connection, the gradient flows through both the residual path and the normalization’s Jacobian, providing the optimizer with curvature information about the combined signal rather than just the sublayer output. This richer signal may facilitate better weight space exploration near the loss minimum, enabling slightly sharper convergence.

However, this advantage is increasingly difficult to realize at scale. As models grow to tens or hundreds of billions of parameters, the cost of training instability is enormous — a divergence event wastes weeks of GPU time. The practical community has overwhelmingly adopted Pre-LN (or RMSNorm-based Pre-LN) for this reason, accepting the small performance trade-off in exchange for reliable training.

4.2 Hybrid Configurations

Several recent works have proposed hybrid normalization schemes that attempt to capture the stability of Pre-LN while recovering the performance of Post-LN. The sandwich norm (Ding et al., 2021) applies normalization both before the sublayer (as in Pre-LN) and after the sublayer output (before the residual addition), effectively normalizing both paths:

$$x_{l+1} = x_l + \text{LN}(F_l(\text{LN}(x_l)))$$

This configuration has theoretical properties intermediate between Pre-LN and Post-LN. The pre-sublayer normalization ensures stable input statistics to $F_l$, while the post-sublayer normalization bounds the magnitude of the added residual increment, reducing the risk of the residual stream growing unboundedly across depth. Some ablations have found that sandwich norm matches or slightly exceeds Pre-LN performance while retaining most of its stability benefits.

4.3 Normalization and the Residual Stream Hypothesis

The mechanistic interpretability literature has developed the residual stream hypothesis — the view that transformer layers can be understood as performing parallel read-write operations on a shared residual stream, with each attention head and feed-forward layer reading from and writing to this stream additively (Elhage et al., 2021). Under this framework, normalization placement has direct implications for the residual stream’s properties as a communication channel.

In Post-LN, the residual stream is periodically re-standardized, ensuring that its statistics remain stable across depth. This supports the residual stream as a stable communication channel but means that individual layer contributions are normalized out of the stream before downstream layers can read them. In Pre-LN, the residual stream accumulates layer contributions without renormalization, which can cause its norm to grow as $O(\sqrt{L})$ across $L$ layers. This growth is generally benign — it is precisely what the pre-sublayer normalization compensates for — but it means that the absolute magnitudes of individual layer contributions become relatively less significant as depth increases, potentially causing the later layers to dominate the residual stream signal.

4.4 Practical Guidance

For practitioners training transformer-based models today, the following guidance emerges from the theoretical and empirical analysis:

First, default to Pre-LN or Pre-RMSNorm for training stability, particularly at scale. The gradient norm analysis provides strong theoretical backing for this choice, and the empirical record at billion-parameter scale uniformly supports it. Second, if pursuing the highest possible final performance on a fixed compute budget and training is reliable, Post-LN with careful warmup scheduling remains a viable option for medium-scale models where training failures can be tolerated. Third, consider QK-Norm independently of Pre/Post-LN placement for long-context applications; it addresses a distinct pathology and is largely orthogonal to the stability concerns analyzed here. Fourth, RMSNorm should be preferred over full LayerNorm in most modern settings; its computational savings are non-trivial at scale and its empirical performance is comparable.

5. Conclusion

Layer normalization placement is not a cosmetic architectural choice — it fundamentally determines the geometry of gradient flow through deep transformer networks. We have shown through formal Jacobian analysis that Post-LN’s placement of normalization outside the residual connection creates exponentially growing gradient norms at depth, requiring compensatory training procedures, while Pre-LN’s identity residual path guarantees at most polynomial gradient norm growth and enables stable training without warmup. The stability-performance trade-off between these configurations is real but increasingly asymmetric: as models scale, the cost of training instability overwhelms the marginal performance benefit of Post-LN, explaining the community’s pragmatic adoption of Pre-LN and RMSNorm-based variants.

Beyond this primary analysis, we have examined QK-Norm as a complementary mechanism addressing attention logit growth, and sandwich norm as a hybrid approach attempting to combine the stability of Pre-LN with the potential performance of Post-LN. The residual stream hypothesis from mechanistic interpretability provides an additional lens through which normalization placement affects a model’s computational structure.

Open questions remain. The theoretical gap between Pre-LN and Post-LN final performance lacks a fully satisfying mechanistic explanation. The optimal normalization strategy for very long context lengths (where residual stream norm growth across millions of tokens becomes relevant) is underexplored. And the interaction between normalization placement and the increasingly common practice of weight tying, shared layers, and other parameter efficiency techniques deserves systematic study. Normalization, often treated as a solved component, continues to offer nontrivial research questions at the frontier of large-scale model training.

References

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019.
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., & Liu, T. Y. (2020). On layer normalization in the transformer architecture. Proceedings of ICML 2020.
Liu, L., Liu, X., Gao, J., Chen, W., & Han, J. (2020). Understanding the difficulty of training transformers. Proceedings of EMNLP 2020.
Zhang, B., & Sennrich, R. (2019). Root mean square layer normalization. Advances in Neural Information Processing Systems, 32.
Shleifer, S., Press, O., & Wolf, T. (2021). NormFormer: Improved transformer pretraining with extra normalization. arXiv preprint arXiv:2110.09456.
Henry, A., Dachapally, P. R., Pawar, S., & Chen, Y. (2020). Query-key normalization for transformers. Findings of EMNLP 2020.
Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., & Olah, C. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Rozi�re, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., Yang, H., & Tang, J. (2021). CogView: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34.