The Lottery Ticket Hypothesis: Sparse Subnetworks, Pruning Theory, and the Geometry of Trainable Neural Architectures

Abstract

The Lottery Ticket Hypothesis (LTH), introduced by Frankle and Carlin (2019), proposes that dense neural networks contain sparse subnetworks—termed “winning tickets”—that, when trained in isolation from their original initialization, can match the performance of the full network in a fraction of the training time. This hypothesis has profound implications for our understanding of neural network optimization, generalization, and the structural priors encoded by standard training procedures. In this post, I analyze the theoretical foundations of the LTH, examine the conditions under which winning tickets exist and transfer across tasks, survey the mechanistic explanations proposed in the literature, and discuss how the hypothesis interfaces with broader questions of network pruning, neural tangent kernel theory, and the implicit bias of stochastic gradient descent. I also address recent challenges to the hypothesis, including the lottery ticket’s dependence on rewinding schedules and the computational cost of the iterative magnitude pruning procedure used to find tickets.

1. Introduction

A central puzzle in deep learning is the relationship between overparameterization and generalization. Neural networks used in practice are vastly overparameterized relative to the number of training examples—yet they generalize remarkably well, resist overfitting under certain training regimes, and exhibit surprising robustness to weight perturbations. Classical statistical learning theory, which bounds generalization error in terms of hypothesis class complexity, offers limited explanatory power in this regime.

One productive lens through which to examine this puzzle is network pruning. It has long been known that trained neural networks can be significantly compressed without loss of accuracy: many weights are near-zero, redundant, or otherwise unnecessary for correct classification. Post-training magnitude pruning—removing weights with the smallest absolute values—achieves high sparsity (often 90% or more) with minimal accuracy degradation across a wide range of architectures and tasks (Han et al., 2015; LeCun et al., 1990).

The Lottery Ticket Hypothesis sharpens and inverts this observation. Rather than pruning after training, Frankle and Carlin (2019) ask: can we identify a sparse subnetwork before or early in training that, if trained from scratch with its original initialization, achieves the same accuracy as the full network? Their empirical answer is yes, but with a critical caveat: the sparse subnetwork must be initialized to the same values it had at the beginning of training, not to a new random initialization. This finding distinguishes winning tickets from arbitrary sparse networks and points to a deep interaction between architecture (sparsity pattern) and initialization (specific weight values).

The implications are substantial. If winning tickets exist broadly, it suggests that the generalization success of overparameterized networks may not require all of their parameters. Instead, a small subset of parameters—the winning ticket—may carry the representational and optimization burden, while the rest serve as a search space that gradient descent must navigate to find the winning subnetwork. Understanding why this is so, and when it holds, connects directly to questions about the implicit regularization of SGD, the geometry of loss landscapes, and the foundations of efficient neural network design.

2. Related Work

The LTH sits at the intersection of several active research threads. Understanding its context requires reviewing foundational and recent work across network pruning, initialization theory, and the theoretical study of overparameterization.

Han et al. (2015) demonstrated that large convolutional and recurrent networks could be pruned to less than 10% of their original parameter count with negligible accuracy loss, provided pruning is interleaved with fine-tuning (“Deep Compression”). This established that sparse solutions to the training objective exist and are accessible via post-hoc pruning, motivating the LTH’s question about pre-training identifiability.

LeCun et al. (1990) introduced Optimal Brain Damage, an early second-order pruning approach using the Hessian of the loss to identify unimportant weights. This foundational work established the idea that network architecture is malleable post-training, and that salience—rather than magnitude alone—is the theoretically correct pruning criterion.

Frankle and Carlin (2019) formalized the LTH and introduced Iterative Magnitude Pruning (IMP) as the procedure for finding winning tickets. Their empirical results showed that winning tickets exist in fully connected networks on MNIST and CNNs on CIFAR-10, but that naive sparse training from random initializations fails—the original initialization is essential.

Frankle et al. (2020) extended the LTH to large-scale settings (ResNets on ImageNet) by introducing “late rewinding”: rather than resetting pruned weights to their initialization at step 0, they rewind to values at a small number of early training steps (e.g., step 500 out of 90,000). This modification was necessary for winning tickets to exist at scale, and raises important questions about what the early training phase computes.

Zhou et al. (2019) challenged the necessity of training winning tickets by showing that pruning masks found via IMP, applied to randomly signed initializations (not the original values), can also yield competitive accuracy—suggesting that the structure of the winning ticket may matter more than the precise initialization values. This “supermask” result connects the LTH to the broader question of random weight networks.

Evci et al. (2020) studied the gradient flow of pruned networks through the lens of “Rigging the Lottery,” showing that dynamic sparse training—where sparsity patterns change throughout training—can match dense network performance while maintaining constant sparsity. This suggests that the fixed-mask assumption of the LTH may be overly restrictive.

Morcos et al. (2019) demonstrated that winning tickets found on one dataset transfer to other datasets, and even across tasks, suggesting that winning tickets capture task-agnostic structural priors about how gradient descent organizes representations. This transferability finding has since been replicated and extended to NLP settings with BERT and related architectures (Chen et al., 2020).

3. Technical Analysis

Let $f(x; \theta)$ denote a neural network with parameters $\theta \in \mathbb{R}^d$ trained to minimize a loss $\mathcal{L}(\theta)$ over a dataset $\mathcal{D}$. Let $\theta_0$ denote the random initialization and $\theta_T$ the parameters after $T$ steps of SGD. A binary mask $m \in \{0,1\}^d$ defines a sparse subnetwork with parameters $m \odot \theta$, where $\odot$ is elementwise multiplication. The sparsity level is $s = 1 – \|m\|_0 / d$.

Formal Statement of the LTH. Frankle and Carlin (2019) state: for any network $f(x; \theta)$ trained to accuracy $a$ in $T$ steps, there exists a mask $m$ and initialization $m \odot \theta_0$ such that the sparse subnetwork $f(x; m \odot \theta)$ trained with the same optimizer achieves accuracy $\geq a$ in $\leq T$ steps, with $\|m\|_0 \ll d$.

Iterative Magnitude Pruning (IMP). The standard procedure for finding winning tickets is:

Initialize $\theta = \theta_0$ randomly.
Train for $T$ steps to obtain $\theta_T$.
Prune the $p\%$ of weights with the smallest magnitude $|\theta_T^i|$, creating mask $m$.
Reset unpruned weights to their initialization: $\theta \leftarrow m \odot \theta_0$.
Repeat from step 2 until target sparsity is reached.

The key operation is step 4: resetting to $\theta_0$ (or to $\theta_k$ for some small $k$ in the late rewinding variant). This is the step that distinguishes winning tickets from arbitrary sparse networks.

Why Does Initialization Matter? The necessity of the original initialization has been explained through several complementary mechanisms:

Signal propagation theory. A well-known line of work (Poole et al., 2016; Schoenholz et al., 2017) shows that random initializations near the “edge of chaos”—where the Jacobian singular values have unit mean—enable effective signal propagation through deep networks. Winning ticket initializations may encode particularly favorable signal propagation properties.

Implicit gradient alignment. The gradients at initialization determine the direction of initial parameter updates. For the sparse subnetwork to train efficiently, its gradients at $m \odot \theta_0$ must be well-aligned with the task loss. Winning ticket initializations may achieve this alignment by virtue of the IMP selection process, which implicitly selects weights whose gradients accumulate effectively toward the task objective.

Neural Tangent Kernel perspective. In the infinite-width limit, neural networks trained with gradient descent converge to kernel regression with the Neural Tangent Kernel (NTK) $\Theta(x, x’) = \nabla_\theta f(x; \theta_0)^\top \nabla_\theta f(x’; \theta_0)$ (Jacot et al., 2018). The sparse subnetwork has a restricted NTK $\Theta_m(x, x’) = \nabla_{m \odot \theta} f(x; m \odot \theta_0)^\top \nabla_{m \odot \theta} f(x’; m \odot \theta_0)$. For the winning ticket to match full network performance, $\Theta_m$ must preserve the eigenspectrum of $\Theta$ sufficiently. The original initialization $\theta_0$ determines $\Theta_m$, explaining why initialization matters.

Late Rewinding and the Role of Early Training. For large networks like ResNet-50 on ImageNet, winning tickets only emerge when weights are rewound to step $k \approx 500$ rather than step 0. This finding, due to Frankle et al. (2020), suggests that early training performs a “warm-up” function that establishes a stable optimization trajectory. Formally, let $\theta_k$ denote parameters at step $k \ll T$. The late-rewound winning ticket is $m \odot \theta_k$, and the ticket property holds for this initialization but not for $m \odot \theta_0$.

This has been interpreted through the lens of the “linear mode connectivity” analysis. Frankle et al. (2020) show that $\theta_k$ and $\theta_T$ are connected by a linear path in parameter space with no loss barrier (up to small perturbations), while $\theta_0$ and $\theta_T$ are not. Early training thus stabilizes the network into a “basin” of the loss landscape where sparse subnetworks can find equivalent solutions.

Sparsity and Generalization. An important theoretical question is whether winning tickets generalize better than arbitrary sparse networks of the same density. Empirically, they do—random sparse masks at the same sparsity level typically fail to train effectively, while winning ticket masks succeed. This suggests that the IMP process discovers masks that preserve the inductive bias of the dense network’s architecture.

A formal explanation connects to the Minimum Description Length (MDL) principle. If we view the pruning mask $m$ as a description of the model, sparser models have shorter descriptions. The lottery ticket selection process can be seen as finding a model that achieves low training loss with a short description, consistent with MDL-based generalization bounds of the form:

$$\mathcal{L}_{\text{gen}}(m \odot \theta) \leq \mathcal{L}_{\text{train}}(m \odot \theta) + \sqrt{\frac{\|m\|_0 \log d + \log(1/\delta)}{2n}}$$

where $n$ is the number of training examples and $\delta$ is the failure probability. Sparser masks $m$ tighten this bound, providing a generalization-theoretic rationale for why winning tickets—which are sparse—generalize well.

Supermasks and Structural Priors. Zhou et al. (2019) showed that a mask $m$ found via IMP, when applied to a randomly signed version of $\theta_0$ (weights randomly set to $\pm |\theta_0^i|$), still yields competitive accuracy without any training. This surprising result—the “supermask” phenomenon—suggests that the sparse architecture itself, independent of learned weight values, contains sufficient representational capacity for the task. Formally, define $\theta_0′ = r \odot |\theta_0|$ where $r_i \in \{-1, +1\}$ is uniform random. Then $f(x; m \odot \theta_0′)$ achieves nontrivial accuracy without gradient updates. This connects to random feature learning and raises the question of whether winning tickets are primarily architectural discoveries rather than initialization discoveries.

4. Discussion

The LTH, while empirically compelling, raises several open questions that are actively debated in the community.

Computational Cost of Finding Tickets. The iterative magnitude pruning procedure requires training the full network multiple times (typically 15–20 IMP rounds to reach 99% sparsity), making the cost of finding a winning ticket far higher than simply training the full network once. From a practical standpoint, this is a significant limitation: if the goal is efficient training, IMP is not the answer. The value of the LTH is therefore primarily theoretical—it establishes that sparse trainable subnetworks exist—rather than providing a practical algorithm for sparse training from scratch.

Dynamic sparse training methods like RigL (Evci et al., 2020) address this by maintaining a sparse network throughout training, periodically updating the sparsity mask. While these methods do not find “winning tickets” in the strict LTH sense, they achieve competitive accuracy at constant sparsity without the full-network pretraining step, making them more practically viable.

Transferability and Universality. Morcos et al. (2019) found that winning tickets transfer across related datasets (CIFAR-10 to CIFAR-100, SVHN), and subsequent work has extended this to NLP (Chen et al., 2020), where BERT winning tickets transfer across GLUE tasks. This transferability is theoretically interesting: it suggests that winning tickets encode general-purpose representational structure rather than task-specific solutions. If true, this would imply that the combinatorial lottery over initializations and architectures is not task-specific, and that a single winning ticket could serve as a foundation model for a family of related tasks—a connection to the contemporary paradigm of foundation models and transfer learning.

Challenges at Scale. The original LTH results were demonstrated on relatively small networks (LeNet, small ResNets). At scale—GPT-scale language models, ViT-scale vision transformers—the late rewinding finding suggests that early training is essential, and the rewind point must be chosen carefully. Furthermore, recent work (Frankle et al., 2021) questions whether IMP finds truly optimal sparse subnetworks or merely locally competitive ones. The optimization landscape of sparse networks is combinatorial and non-convex, and IMP’s greedy magnitude-based pruning is not guaranteed to find global optima.

Connection to Overparameterization Theory. The LTH connects to a broader theoretical program understanding why overparameterized networks train to low training loss (Allen-Zhu et al., 2019; Du et al., 2019). These works show, under assumptions of sufficient width, that SGD converges to global minima of the training loss. The LTH adds a structural dimension: not only does overparameterization enable global convergence, but it also creates a diverse enough landscape that sparse trainable subnetworks exist within the full network’s architecture. The overparameterized network serves as a search distribution over sparse architectures, and training (via IMP) is a procedure for sampling from this distribution.

Biological Parallels. The LTH has attracted attention for its potential parallels to synaptic pruning in biological neural development. Infant brains begin with an overabundance of synaptic connections, which are pruned during development based on activity patterns. If biological neural networks also operate via a lottery ticket mechanism—with genetic initialization providing the initial connectivity and experience-dependent activity selecting the winning subnetwork—this would suggest deep connections between artificial and biological learning. While these parallels are speculative and should be treated cautiously, they highlight the LTH as a potentially unifying principle across learning systems.

5. Conclusion

The Lottery Ticket Hypothesis has significantly advanced our understanding of neural network training, sparsity, and the role of initialization. Its central claim—that dense networks contain sparse winning tickets that can train as effectively as the full network—has been empirically validated across architectures and tasks, extended to large-scale settings via late rewinding, and theoretically connected to NTK theory, MDL-based generalization bounds, and the implicit bias of gradient descent.

The most important theoretical insight of the LTH is that overparameterization is not merely a convenient technical condition for convergence proofs: it actively creates a rich landscape of sparse trainable subnetworks from which gradient descent (via IMP) can extract a winning ticket. This reframes the question of why overparameterized networks generalize well—not because all weights contribute equally, but because the network’s overparameterized search space makes it likely that a compact, well-initialized subnetwork exists within it.

Key open problems include: (1) efficient algorithms for finding winning tickets without multi-round IMP; (2) theoretical characterization of which initializations produce winning tickets; (3) whether late rewinding truly identifies winning tickets or merely a local variant; (4) extending LTH analysis to attention-based architectures like transformers, where the notion of “weight” and “pruning” is less straightforward than in feedforward or convolutional networks.

As language models and vision transformers continue to scale, understanding how to extract maximally compact trainable subnetworks will have direct implications for efficient inference, continual learning, and the design of foundation models. The LTH provides a foundational framework for this inquiry, and resolving its open questions remains one of the most productive frontiers in deep learning theory.

References

Allen-Zhu, Z., Li, Y., & Song, Z. (2019). A convergence theory for deep learning via over-parameterization. ICML 2019.
Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y., Wang, Z., & Carbin, M. (2020). The lottery ticket hypothesis for pre-trained BERT networks. NeurIPS 2020.
Du, S., Lee, J., Li, H., Wang, L., & Zhai, X. (2019). Gradient descent finds global minima of non-convex neural networks. ICML 2019.
Evci, U., Gale, T., Menick, J., Castro, P. S., & Elsen, E. (2020). Rigging the lottery: Making all tickets winners. ICML 2020.
Frankle, J., & Carlin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. ICLR 2019.
Frankle, J., Dziugaite, G. K., Roy, D. M., & Carlin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. ICML 2020.
Han, S., Pool, J., Tran, J., & Dally, W. J. (2015). Learning both weights and connections for efficient neural networks. NeurIPS 2015.
Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in neural networks. NeurIPS 2018.
LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. NeurIPS 1990.
Morcos, A., Yu, H., Paulus, M., & Cho, K. (2019). One ticket to win them all: Generalizing lottery ticket initializations across datasets and optimizers. NeurIPS 2019.
Poole, B., Lahiri, S., Raghu, M., Sohl-Dickstein, J., & Ganguli, S. (2016). Exponential expressivity in deep neural networks through transient chaos. NeurIPS 2016.
Schoenholz, S. S., Gilmer, J., Ganguli, S., & Sohl-Dickstein, J. (2017). Deep information propagation. ICLR 2017.
Zhou, H., Lan, J., Liu, R., & Yosinski, J. (2019). Deconstructing lottery tickets: Zeros, signs, and the supermask. NeurIPS 2019.