Abstract
Continual learning — the capacity of a model to sequentially acquire new knowledge without destroying previously learned representations — remains one of the most fundamental open problems in machine learning. The central obstacle is catastrophic forgetting: a neural network trained on a new task rapidly overwrites weights encoding prior tasks, causing performance on those tasks to collapse. This phenomenon, first documented in connectionist systems in the late 1980s, has resisted clean theoretical resolution despite decades of algorithmic progress. In this post, I examine the problem from multiple angles: the stability-plasticity dilemma framing, the geometric mechanisms by which gradient descent drives forgetting, three dominant algorithmic paradigms (regularization, replay, and architectural expansion), and the theoretical gaps that still leave continual learning far from solved in realistic deployment settings. I also survey recent empirical benchmarks and identify the conditions under which different approaches are and are not effective.
1. Introduction
The standard machine learning pipeline assumes that all training data is available simultaneously. This assumption is violated in virtually every realistic deployment scenario. A clinical NLP system must incorporate new terminology as medical language evolves. A robotic agent must adapt to new environments without forgetting how to navigate old ones. A language model serving millions of users should incorporate new factual knowledge without catastrophic regression on established capabilities.
The biological analogy here is seductive but should be handled carefully. Humans clearly exhibit some form of continual learning — we learn new skills without forgetting how to ride bicycles — but the mechanisms are quite different from gradient-based artificial neural networks. Neuroscience identifies complementary learning systems: a fast hippocampal system for rapid episodic encoding and a slower neocortical system for gradual consolidation (McClelland et al., 1995). Artificial systems have no such architectural division by default.
Catastrophic forgetting in neural networks arises from a specific property of gradient descent: parameter updates are not localized. When a network is fine-tuned on Task B, the gradient signal from Task B loss updates weights that are also critical for Task A. Because there is no mechanism preventing Task A-relevant weight configurations from being overwritten, Task A performance degrades rapidly — often to near-chance levels after only a few gradient steps on Task B. McCloskey and Cohen (1989) first described this phenomenon systematically, and it has been reproduced across virtually every architecture and task domain studied since.
The challenge is that plasticity and stability are fundamentally in tension. A network that is maximally plastic — one that updates all its weights freely in response to new data — will forget everything it has learned. A network that is maximally stable — one that freezes all its weights — cannot learn anything new. Productive continual learning requires navigating this dilemma, allocating plasticity where it is needed without sacrificing stability where it has been earned.
This post organizes the technical landscape as follows. Section 2 surveys the key prior work spanning regularization, replay, and architectural approaches. Section 3 provides a technical analysis of the geometric mechanisms of forgetting and the formal properties of leading algorithms. Section 4 discusses unresolved questions, including the benchmark validity problem and the tension between task-incremental and class-incremental settings. Section 5 concludes with a synthetic assessment of where the field stands.
2. Related Work
The modern continual learning literature has converged on a handful of canonical methods that serve as baselines and conceptual anchors.
Elastic Weight Consolidation (EWC). Kirkpatrick et al. (2017) introduced EWC as an explicit formalization of selective forgetting prevention. The central idea is to estimate, for each parameter $\theta_i$, an importance weight $F_i$ corresponding to its contribution to the Task A solution, and then penalize large deviations from the Task A parameter values during Task B training. The penalty term takes the form:
$$\mathcal{L}_{EWC}(\theta) = \mathcal{L}_B(\theta) + \frac{\lambda}{2} \sum_i F_i (\theta_i – \theta_i^*)^2$$
where $\theta_i^*$ are the parameters after Task A training and $F_i$ is the $i$-th diagonal of the Fisher information matrix, computed on Task A data. EWC was influential and sparked a large body of follow-on work, but it has well-documented limitations: the Fisher diagonal is a coarse approximation to the true curvature, and the approach does not scale gracefully to many sequential tasks.
Progressive Neural Networks. Rusu et al. (2016) took a radically different approach: rather than constraining how existing weights change, they freeze all weights from previous tasks entirely and add new network columns for each new task, with lateral connections allowing the new column to leverage prior representations. This fully eliminates forgetting by construction, at the cost of growing model size linearly with task count. The approach works well for a small number of tasks in controlled settings but is impractical for open-ended deployment.
Gradient Episodic Memory (GEM) and A-GEM. Lopez-Paz and Ranzato (2017) proposed storing a small episodic memory buffer from each previous task and enforcing the constraint that gradient updates on the current task do not increase loss on stored examples. This is formalized as a quadratic programming problem at each step, ensuring that parameter updates are projected into a feasible cone. Chaudhry et al. (2019) introduced A-GEM, which relaxes the per-task constraint to an average constraint, yielding substantial computational savings at comparable empirical performance.
Experience Replay and Dark Experience Replay. Simple experience replay — maintaining a buffer of past examples and mixing them into current training batches — is surprisingly competitive when tuned well. Buzzega et al. (2020) introduced Dark Experience Replay (DER), which augments replay with knowledge distillation: the buffer stores not just (input, label) pairs but also the logit outputs of the model at the time the example was first processed. Training then includes a distillation loss that encourages current model outputs on buffered examples to match those stored logits, preserving the functional behavior of the network at the time of original training rather than merely its class assignments.
PackNet and Supermask. Mallya and Lazebnik (2018) introduced PackNet, which applies iterative pruning to identify redundant parameters after each task and reserves them for future tasks. Wortsman et al. (2020) generalized this with the Supermask approach, learning binary masks over weights for each task via differentiable relaxation. Both approaches represent a line of architectural methods that avoid forgetting by partitioning model capacity across tasks, with theoretical guarantees on Task A performance at the cost of bounded capacity for future learning.
3. Technical Analysis
3.1 The Geometry of Forgetting
To understand why catastrophic forgetting occurs so readily, it is useful to think about the loss landscape geometry. Let $\mathcal{L}_A(\theta)$ and $\mathcal{L}_B(\theta)$ denote the loss functions for Tasks A and B respectively. After training on Task A, the model resides in a region $\Theta_A^*$ that minimizes $\mathcal{L}_A$. The critical question is: what is the geometry of $\mathcal{L}_A$ in the vicinity of $\Theta_A^*$?
For overparameterized networks, there are typically many solutions — the task-A minimum is not a point but a manifold. Empirically, the loss landscape around a well-trained neural network is often relatively flat in many directions (Goodfellow et al., 2015), meaning that small perturbations to the weights in certain directions have negligible effect on Task A performance. The directions of high curvature (large eigenvalues of the Hessian) correspond to parameters that are critical for Task A; the flat directions correspond to parameters that could be changed without significant cost.
Catastrophic forgetting occurs because gradient descent on Task B is not curvature-aware with respect to Task A. The update $\Delta\theta = -\eta \nabla_{\theta} \mathcal{L}_B(\theta)$ moves parameters along the gradient of $\mathcal{L}_B$, which need not respect the high-curvature directions of $\mathcal{L}_A$. EWC’s Fisher-based penalty is an attempt to approximate this curvature and penalize movement in high-curvature Task A directions, but the diagonal approximation to the Fisher misses cross-parameter interactions that can be critical.
A natural extension is to work with the full Fisher or the Kronecker-factored approximation (K-FAC). Ritter et al. (2018) developed Laplace approximation-based continual learning using K-FAC to model the posterior over weights after each task, yielding better-calibrated importance estimates. However, even K-FAC is an approximation, and for large models the computational cost of maintaining a meaningful curvature estimate per task is prohibitive.
3.2 Task Boundaries and the Class-Incremental Problem
A critical but often underemphasized distinction in continual learning evaluation is between task-incremental and class-incremental settings. In the task-incremental setting, the model is given task identity at test time and can apply task-specific output heads or masking. In the class-incremental setting, the model must classify across all classes seen so far without access to task identity.
Van de Ven and Tolias (2019) demonstrated empirically that methods which perform well in task-incremental settings often fail catastrophically in class-incremental settings. This matters because class-incremental is the realistic deployment scenario: a classifier that sees new classes over time must be able to decide whether a test input belongs to any of the classes it has ever seen, not just the most recently trained subset. The presence of task identity at test time is an oracle that dramatically simplifies the problem.
In class-incremental learning, replay-based methods hold a significant practical advantage. By maintaining samples from old classes and replaying them during new task training, these methods directly address the data imbalance that drives forgetting. Without replay, models trained with cross-entropy loss are biased toward recently seen classes because the gradient signal from old classes has been entirely removed from the training distribution.
3.3 Replay Buffer Strategies
A fundamental constraint in replay-based methods is buffer budget. In realistic deployments, storing large fractions of prior training data is often infeasible for privacy, storage, or computational reasons. This motivates a significant subproblem: given a budget of $M$ stored examples distributed across $T$ tasks, what selection and retrieval strategies maximize continual learning performance?
Reservoir sampling (Vitter, 1985) provides a principled baseline: maintain a uniform random sample of all data seen so far by replacing stored examples with probability $M / n$ when the $n$-th example arrives. This preserves unbiased coverage of the joint training distribution. However, uniform sampling ignores difficulty and representativeness structure.
More recent work has explored difficulty-aware selection (store examples near the decision boundary), coverage-maximizing selection (maximize diversity in feature space), and gradient-alignment selection (store examples whose gradients conflict most with current-task gradients, as in the original GEM paper). Empirical comparisons suggest that no single selection strategy dominates across all settings, and the relative advantage of sophisticated strategies over reservoir sampling is often modest (Hayes et al., 2020).
Generative replay (Shin et al., 2017) replaces stored examples with samples from a generative model trained on prior tasks. This sidesteps storage constraints but introduces a new source of error: the generative model itself must be trained continually and is itself subject to forgetting. In practice, generative replay tends to underperform memory replay in all but the simplest benchmark settings, though recent work with high-quality diffusion models has revived interest in the approach.
3.4 Regularization Methods: Formal Properties
Returning to regularization methods, it is worth examining their formal properties more carefully. EWC and its variants add a penalty of the form:
$$\mathcal{L}_{reg}(\theta) = \mathcal{L}_B(\theta) + \Omega(\theta, \theta^*, F)$$
where $\Omega$ measures deviation from previous task solutions weighted by parameter importance. For EWC, $\Omega = \frac{\lambda}{2} \sum_i F_i (\theta_i – \theta_i^*)^2$. A fundamental limitation of this formulation is that it anchors to a fixed point $\theta^*$, which becomes problematic as the number of tasks grows. After $T$ tasks, the regularization term accumulates over all previous tasks:
$$\mathcal{L}_{EWC-all}(\theta) = \mathcal{L}_T(\theta) + \sum_{k=1}^{T-1} \frac{\lambda_k}{2} \sum_i F_i^{(k)} (\theta_i – \theta_i^{(k)*})^2$$
The sum of importance weights across all tasks eventually covers most of the parameter space, leaving no room for plasticity. This is the well-documented “rigidity” problem of EWC at scale. Heuristic solutions include annealing $\lambda_k$ for older tasks or maintaining only the most recent anchor point, but these approaches sacrifice theoretical guarantees.
An important alternative is Online EWC (Schwarz et al., 2018), which maintains a running average of the Fisher information across tasks rather than summing all individual task Fishers. This provides a scalable approximation but blurs the task-specific importance information that makes EWC work in the first place.
4. Discussion
4.1 Benchmark Validity and the Split-MNIST Problem
The continual learning literature has a benchmark validity problem. A large fraction of published results are reported on Split-MNIST (dividing MNIST into five binary classification tasks), Split-CIFAR-10, or Permuted-MNIST. These benchmarks have become so well-studied that methods can be tuned to them specifically, and performance on them may not generalize to more realistic settings.
Farquhar and Gal (2018) made this critique explicitly, arguing that many continual learning benchmarks are too simple and that strong baselines — including plain fine-tuning — perform better than acknowledged in comparison to overcomplicated continual learning methods. This should be taken seriously: if the benchmark tasks share low-level feature representations, inter-task interference is naturally lower and the challenge is artificial.
More recently, benchmarks like Split-CIFAR-100 (with 20 tasks of 5 classes each), CORe50 (with continuous object recognition), and the CL-Benchmark suite (De Lange et al., 2022) have been proposed to provide more challenging and realistic evaluation. On these harder benchmarks, the relative rankings of methods often change substantially from their Split-MNIST rankings.
4.2 The Role of Pre-training
The continual learning landscape has been significantly altered by the ubiquity of large pre-trained models. When a model is initialized from a strong pre-trained backbone (e.g., a CLIP vision encoder or a pre-trained language model), the feature representations are already rich and general. Fine-tuning only the task-specific head — or a small adapter module — dramatically reduces inter-task interference because the shared backbone is largely unchanged.
Mehta et al. (2023) showed that simple fine-tuning with a frozen pre-trained backbone outperforms most specialized continual learning algorithms on standard benchmarks, questioning whether the field’s elaborate machinery is solving the right problem. This is not an argument against continual learning research; rather, it reframes the challenge as adapting within or extending a strong pre-trained model, rather than learning from scratch sequentially.
From a theoretical perspective, this makes sense: a pre-trained model that generalizes well has learned representations that are close to a rich function class, and the local loss landscape around this initialization is qualitatively different from random initialization. The curvature structure tends to be flatter in more directions, providing more room for task-specific adaptation without disrupting shared representations.
4.3 Forward Transfer and Backward Transfer Metrics
Evaluation in continual learning requires metrics beyond simple accuracy on the final task. The field has converged on two complementary measures:
- Backward Transfer (BWT): The average change in performance on previous tasks after new task training. Negative BWT indicates forgetting; positive BWT (backward positive transfer) indicates that new learning improved performance on old tasks.
- Forward Transfer (FWT): The average gain in performance on future tasks from learning prior tasks, relative to a random-initialization baseline. Positive FWT indicates that prior learning facilitates new learning.
Many published methods optimize only BWT (forgetting prevention) while ignoring FWT. But an ideal continual learner should exhibit positive forward transfer — using accumulated knowledge to learn new things faster and better. This is perhaps the more interesting and practically important property, and it is substantially less studied in the literature.
4.4 Connections to Meta-Learning and Bayesian Inference
There are natural connections between continual learning and meta-learning that deserve more explicit attention. A meta-learning system trained to quickly adapt to new tasks with minimal forgetting is, in a sense, solving continual learning by amortization. MAML-based approaches (Finn et al., 2017) find parameter initializations that can be rapidly fine-tuned to new tasks, but the original MAML formulation does not explicitly prevent forgetting of prior tasks when applied sequentially.
Bayesian continual learning provides a principled framework: maintain a posterior over parameters after each task, and use it as the prior for the next task. EWC can be derived as an approximation to this Bayesian update where the posterior is approximated as a diagonal Gaussian with mean at the MAP estimate and variance given by the inverse Fisher. Variational continual learning (Nguyen et al., 2018) extends this with a proper variational inference treatment, yielding better-calibrated uncertainty and empirically improved performance at the cost of significantly higher computational complexity.
5. Conclusion
Continual learning and catastrophic forgetting represent a genuinely hard problem that has resisted clean algorithmic resolution for nearly four decades. The stability-plasticity dilemma is not merely an engineering challenge but reflects fundamental tensions in gradient-based learning: parameter updates must be both informative and conservative simultaneously, a requirement that standard gradient descent cannot satisfy without additional structure.
The field’s most robust empirical findings are that: (1) replay-based methods are consistently among the most effective, especially in class-incremental settings; (2) the class-incremental setting is substantially harder than task-incremental and more representative of real deployment; (3) strong pre-trained initializations reduce the severity of catastrophic forgetting significantly and invalidate some of the complexity of specialized algorithms; and (4) current benchmarks are insufficient to characterize the full challenge and may reward methods that are not generally robust.
Looking forward, the most interesting research directions involve understanding continual learning in the regime of large pre-trained models: when is it safe to fine-tune in place versus isolate task-specific parameters in adapters? How should buffer selection be redesigned when the model has rich pre-trained representations? And how do we define and measure forward transfer in a way that reflects genuine generalization rather than dataset-specific overfitting?
The theoretical gap remains substantial. We lack a complete characterization of when and why continual learning is possible, what the fundamental limits on backward transfer are given a fixed parameter budget and task sequence, and how to optimally trade off plasticity and stability as a function of task similarity and buffer size. Addressing these theoretical questions will likely require new connections between the learning theory, optimization, and neuroscience literatures — connections that have been gestured at but not yet rigorously established.
References
- Buzzega, P., Boschini, M., Porrello, A., Abati, D., & Calderara, S. (2020). Dark experience for general continual learning: a strong, simple baseline. NeurIPS 2020.
- Chaudhry, A., Ranzato, M., Rohrbach, M., & Elhoseiny, M. (2019). Efficient lifelong learning with A-GEM. ICLR 2019.
- De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia, X., Leonardis, A., Slabaugh, G., & Tuytelaars, T. (2022). A continual learning survey: Defying forgetting in classification tasks. IEEE TPAMI.
- Farquhar, S., & Gal, Y. (2018). Towards robust evaluations of continual learning. arXiv:1805.09733.
- Finn, C., Abbeel, P., & Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. ICML 2017.
- Goodfellow, I., Vinyals, O., & Saxe, A. (2015). Qualitatively characterizing neural network optimization problems. ICLR 2015.
- Hayes, T. L., Kafle, K., Shrestha, R., Acharya, M., & Kanan, C. (2020). REMIND your neural network to prevent catastrophic forgetting. ECCV 2020.
- Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., … & Hadsell, R. (2017). Overcoming catastrophic forgetting in neural networks. PNAS, 114(13).
- Lopez-Paz, D., & Ranzato, M. (2017). Gradient episodic memory for continual learning. NeurIPS 2017.
- Mallya, A., & Lazebnik, S. (2018). PackNet: Adding multiple tasks to a single network by iterative pruning. CVPR 2018.
- McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24.
- McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex. Psychological Review, 102(3).
- Mehta, S., Patil, D., Chandar, S., & Strubell, E. (2023). An empirical investigation of the role of pre-training in lifelong learning. JMLR, 24(214).
- Nguyen, C. V., Li, Y., Bui, T. D., & Turner, R. E. (2018). Variational continual learning. ICLR 2018.
- Ritter, H., Botev, A., & Barber, D. (2018). Online structured Laplace approximations for overcoming catastrophic forgetting. NeurIPS 2018.
- Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., … & Hadsell, R. (2016). Progressive neural networks. arXiv:1606.04671.
- Schwarz, J., Czarnecki, W., Luketina, J., Grabska-Barwinska, A., Teh, Y. W., Pascanu, R., & Hadsell, R. (2018). Progress & compress: A scalable framework for continual learning. ICML 2018.
- Shin, H., Lee, J. K., Kim, J., & Kim, J. (2017). Continual learning with deep generative replay. NeurIPS 2017.
- Van de Ven, G. M., & Tolias, A. S. (2019). Three scenarios for continual learning. arXiv:1904.07734.
- Vitter, J. S. (1985). Random sampling with a reservoir. ACM TOMS, 11(1).
- Wortsman, M., Ramanujan, V., Liu, R., Kembhavi, A., Rastegari, M., Yosinski, J., & Farhadi, A. (2020). Supermasks in superposition. NeurIPS 2020.