Abstract
Knowledge distillation (KD) has emerged as a foundational technique for compressing large neural networks into smaller, deployment-ready student models without catastrophic performance degradation. The choice of loss function governs the fidelity of the knowledge transfer and determines which aspects of the teacher’s learned representations are preserved. This paper presents a rigorous comparative analysis of the major distillation objectives: the classical Kullback–Leibler divergence on soft logits (Hinton et al., 2015), feature-matching losses operating on intermediate activations (Romero et al., 2015), attention transfer (Zagoruyko & Komodakis, 2017), relational knowledge distillation (Park et al., 2019), and recent contrastive and task-adaptive variants. We examine the theoretical motivations, gradient dynamics, and practical tradeoffs of each approach. We further analyze how temperature scaling modulates the information content of soft targets and characterize regimes in which each objective class dominates. Our synthesis provides a principled guide for practitioners selecting or composing distillation objectives.
1. Introduction
The deployment of deep neural networks in resource-constrained environments—mobile devices, edge hardware, latency-sensitive APIs—creates a fundamental tension between model capacity and computational budget. Knowledge distillation addresses this tension by transferring the generalization properties of a large teacher model $T$ into a smaller student model $S$, typically through a training objective that encourages $S$ to mimic aspects of $T$’s behavior beyond mere label agreement.
The original formulation of Hinton et al. (2015) distills the teacher’s output distribution, or soft targets, into the student via a temperature-scaled KL divergence. The intuition is that soft targets encode rich inter-class similarity structure: the probability mass assigned to incorrect classes is not noise but signal, reflecting how the teacher has organized the input manifold. A model that assigns 0.001 probability to class cat when processing a dog image is communicating more than a hard label ever could.
Yet logit-level distillation is only one point in a large design space. The teacher’s intermediate representations—activations, attention maps, Gram matrices, relational distances—each capture different facets of learned structure. A student that merely mimics output distributions may fail to acquire useful internal representations for transfer to new tasks. This observation has motivated a rich literature of alternative and complementary objectives.
The loss function is not merely a training detail; it is an inductive bias over what knowledge is transferable. The central question this paper addresses is: what information does each class of distillation loss actually transfer, and under what conditions is that information most valuable?
We organize the analysis as follows. Section 2 surveys the primary strands of prior work. Section 3 provides technical analysis of the major loss classes, including gradient characterizations and information-theoretic interpretations. Section 4 discusses empirical findings and practical considerations. Section 5 concludes with open problems.
2. Related Work
The landscape of distillation objectives spans more than a decade of work, with motivations ranging from model compression to data-free training and cross-modal transfer.
Hinton et al. (2015) introduced the soft-target KL loss alongside a linear combination with the hard cross-entropy, establishing the canonical two-component objective that most subsequent work inherits. Their temperature parameter $\tau$ remains a central hyperparameter across virtually all output-based methods.
Romero et al. (2015) introduced FitNets, extending distillation to intermediate feature maps through a regression loss between teacher and student activations, requiring an additional learned linear projection to align differing dimensionalities. This work demonstrated that intermediate supervision accelerates training and enables students thinner but deeper than their teachers.
Zagoruyko & Komodakis (2017) proposed Attention Transfer (AT), arguing that the spatial distribution of activation magnitudes—rather than the activations themselves—constitutes a transferable summary of where the network focuses. Their attention maps $A(F) = \sum_c |F_c|^p$ (summed over channels, raised to power $p$) are matched between teacher and student layers via an $\ell_2$ loss, providing a richer geometric signal than scalar feature regression.
Park et al. (2019) introduced Relational Knowledge Distillation (RKD), shifting focus from individual instance representations to the relational structure of mini-batches. Distance-wise and angle-wise losses penalize differences in pairwise distances and triplet angles between teacher and student embeddings, enforcing structural consistency rather than pointwise fidelity. RKD is particularly effective when teacher and student have mismatched architectures.
Tian et al. (2020) proposed Contrastive Representation Distillation (CRD), framing distillation as a mutual information maximization problem between teacher and student representations. Using a contrastive objective over a memory buffer of negatives, CRD consistently outperforms prior methods on cross-architecture distillation benchmarks, providing an information-theoretic grounding for feature-level transfer.
Mirzadeh et al. (2020) identified a capacity gap problem: when the teacher is much larger than the student, direct distillation degrades, and they proposed using intermediate-sized teacher assistants (TAKD) to bridge the gap. This work highlights that the absolute quality of the loss function interacts with the teacher-student capacity ratio in non-trivial ways.
Touvron et al. (2021) introduced the DeiT training recipe incorporating a distillation token in Vision Transformers, enabling hard-label distillation from a CNN teacher into a ViT student—demonstrating cross-architecture distillation at scale and yielding state-of-the-art efficiency on ImageNet.
3. Technical Analysis
3.1 Output Distribution Matching: KL Divergence with Temperature
Let $z^T \in \mathbb{R}^C$ and $z^S \in \mathbb{R}^C$ denote the logit vectors of teacher and student respectively for a $C$-class problem. Define the temperature-scaled softmax:
$$p^T_c(\tau) = \frac{\exp(z^T_c / \tau)}{\sum_{c’} \exp(z^T_{c’} / \tau)}$$
The Hinton distillation loss is:
$$\mathcal{L}_{\text{KD}} = \tau^2 \cdot D_{\text{KL}}\left(p^T(\tau) \,\|\, p^S(\tau)\right) = \tau^2 \sum_c p^T_c(\tau) \log \frac{p^T_c(\tau)}{p^S_c(\tau)}$$
The $\tau^2$ factor compensates for the gradient magnitude suppression introduced by temperature: since $\partial \mathcal{L}_{\text{KD}} / \partial z^S_c \propto \tau^{-1}(p^S_c – p^T_c)$, multiplying by $\tau^2$ restores gradient magnitudes to $\mathcal{O}(1)$ in $\tau$, enabling stable combination with the hard-label cross-entropy $\mathcal{L}_{\text{CE}}(y, p^S(1))$.
The combined loss is:
$$\mathcal{L} = (1-\alpha) \mathcal{L}_{\text{CE}} + \alpha \mathcal{L}_{\text{KD}}$$
Crucially, as $\tau \to \infty$, the soft targets approach a uniform distribution, reducing to hard labels. As $\tau \to 0$, they collapse to one-hot vectors. The informative regime is intermediate, where $\tau$ is large enough to reveal inter-class similarities but not so large as to wash them out. Empirically, $\tau \in [3, 6]$ is commonly effective.
Information-theoretic interpretation. The entropy of soft targets $H(p^T(\tau))$ grows with $\tau$. The mutual information $I(X; Y^T(\tau))$ between inputs and temperature-scaled teacher outputs decreases with $\tau$. There is therefore an optimal $\tau$ that balances label entropy (richer dark knowledge) against information retention—a tradeoff that depends on the capacity gap and dataset structure.
3.2 Feature-Level Regression: FitNets and Variants
Let $f^T_\ell \in \mathbb{R}^{H \times W \times C_T}$ and $f^S_\ell \in \mathbb{R}^{H \times W \times C_S}$ be intermediate feature maps at a chosen layer $\ell$. Since $C_T \neq C_S$ in general, a learned regressor $r_\phi: \mathbb{R}^{C_S} \to \mathbb{R}^{C_T}$ is introduced:
$$\mathcal{L}_{\text{Hint}} = \frac{1}{HWC_T} \left\| f^T_\ell – r_\phi(f^S_\ell) \right\|_F^2$$
This objective imposes a pointwise constraint: each spatial location in the student must, after projection, match the teacher. The Frobenius norm treats all feature dimensions symmetrically, which can be problematic when teacher features have high-variance directions that dominate the loss without being the most semantically meaningful.
Gradient dynamics. The gradient of $\mathcal{L}_{\text{Hint}}$ with respect to $f^S_\ell$ propagates through $r_\phi$: $\nabla_{f^S} \mathcal{L}_{\text{Hint}} = \frac{2}{HWC_T} J_{r_\phi}^\top (r_\phi(f^S) – f^T)$. Early in training, $r_\phi$ is random, so gradients are noisy—motivating the FitNets two-stage training protocol in which the hint layers are pre-trained before the full distillation objective is applied.
3.3 Attention Transfer
Attention maps summarize the spatial response of a layer as:
$$A(F) = \left\| \sum_{c=1}^C |F_c|^p \right\|_2^{-1} \sum_{c=1}^C |F_c|^p \in \mathbb{R}^{H \times W}$$
normalized to unit $\ell_2$ norm. The AT loss over a set of paired layers $\mathcal{P}$ is:
$$\mathcal{L}_{\text{AT}} = \frac{\beta}{2} \sum_{(\ell_T, \ell_S) \in \mathcal{P}} \left\| A(f^T_{\ell_T}) – A(f^S_{\ell_S}) \right\|_2^2$$
AT discards magnitude information (via normalization) and channel-level structure (via summation), retaining only the relative spatial focus pattern. This is advantageous when the teacher and student have different numbers of channels or different activation scales, as it avoids the dimensionality-matching problem of FitNets while still encoding spatial inductive biases about where to look.
The exponent $p$ controls attention sharpness: $p=1$ yields a diffuse map sensitive to weak activations; $p=2$ (commonly used) emphasizes dominant spatial regions; larger $p$ approaches an argmax over spatial locations.
3.4 Relational Knowledge Distillation
Rather than matching instance representations, RKD operates on the geometry of embedded batches. For a mini-batch $\{x_i\}_{i=1}^N$ with teacher embeddings $\{t_i\}$ and student embeddings $\{s_i\}$, the distance-wise loss is:
$$\mathcal{L}_{\text{RKD-D}} = \frac{1}{N^2} \sum_{i,j} \ell_\delta\left(\tilde{d}^T_{ij} – \tilde{d}^S_{ij}\right)$$
where $\tilde{d}_{ij} = d_{ij} / \mu_d$ normalizes pairwise distances $d_{ij} = \|e_i – e_j\|_2$ by the mean distance $\mu_d$ within the batch, and $\ell_\delta$ is the Huber (smooth $\ell_1$) loss for robustness to outliers. The angle-wise loss similarly enforces consistency of triplet angles:
$$\mathcal{L}_{\text{RKD-A}} = \frac{1}{N^3} \sum_{i,j,k} \ell_\delta\left(\cos\angle(t_i, t_j, t_k) – \cos\angle(s_i, s_j, s_k)\right)$$
RKD’s structural objective is invariant to isometric transformations of the embedding space, making it robust to architectural differences. However, it requires larger batch sizes to estimate reliable pairwise statistics, and its $\mathcal{O}(N^2)$ (or $\mathcal{O}(N^3)$ for angle-wise) complexity is non-trivial at large batch sizes.
3.5 Contrastive Representation Distillation (CRD)
CRD frames distillation as maximizing the mutual information $I(T; S)$ between teacher and student representations. Using a contrastive lower bound (Oord et al., 2018):
$$\mathcal{L}_{\text{CRD}} = -\mathbb{E}_{(t,s)^+} \left[\log \frac{h(t, s)}{h(t, s) + \frac{N_-}{N} \sum_{s^-} h(t, s^-)}\right]$$
where $h(t, s) = \exp(t^\top s / \|t\| \|s\| / \phi)$ is a scaled cosine similarity, $(t, s)^+$ denotes a positive pair from the same input, and $\{s^-\}$ are negatives drawn from a memory bank. CRD provides a tighter MI lower bound as $N_-$ grows, at the cost of memory bank maintenance. Empirically, it is the strongest single feature-level objective on cross-architecture distillation benchmarks.
3.6 Composite Objectives and Task Adaptation
In practice, the best results are obtained by composing multiple loss terms. A general framework is:
$$\mathcal{L}_{\text{total}} = \alpha \mathcal{L}_{\text{CE}} + \beta \mathcal{L}_{\text{KD}} + \gamma \mathcal{L}_{\text{feat}}$$
where $\mathcal{L}_{\text{feat}}$ can be AT, RKD, CRD, or a combination. The relative weights $\alpha, \beta, \gamma$ introduce a hyperparameter search space, but heuristics exist: $\beta \mathcal{L}_{\text{KD}}$ dominates when the teacher is well-calibrated; $\gamma \mathcal{L}_{\text{feat}}$ becomes more important for cross-architecture or cross-modal distillation where output distribution matching is insufficient.
Task-adaptive distillation (TAKD, Mirzadeh et al., 2020) inserts teacher assistants of intermediate size $T = T_0 \succ T_1 \succ \cdots \succ S$ (where $\succ$ denotes parameter count ordering), each distilled from the previous. This effectively performs curriculum distillation through a chain of capacity-matched objectives, and is particularly important when the teacher-student capacity ratio exceeds roughly $10\times$.
4. Discussion
4.1 What Each Loss Actually Transfers
The KL divergence objective transfers the teacher’s classification boundary geometry: which classes are similar to which, as encoded in the posterior over classes. It is highly effective when the teacher’s calibration is good and when inter-class relationships are semantically meaningful (e.g., fine-grained visual classification). It fails to transfer information about how the teacher processes inputs—the representational strategy encoded in intermediate layers.
Feature regression (FitNets) transfers representational content at specific layers but is sensitive to the layer choice and to the quality of the learned projection. Attention transfer is more robust due to its dimensionality-invariant formulation but discards magnitude and channel information. RKD transfers metric structure of the embedding space, which is particularly valuable for metric learning tasks. CRD transfers the maximum information about the teacher’s representation as a whole.
4.2 The Temperature Hyperparameter: A Deeper Look
Temperature scaling is conceptually simple but its effect on the loss landscape is subtle. At $\tau = 1$, the KD loss is dominated by the highest-confidence classes; at large $\tau$, it becomes a softer, more distributed signal. Crucially, the gradient of the KD loss with respect to student logits is:
$$\frac{\partial \mathcal{L}_{\text{KD}}}{\partial z^S_c} = \tau (p^S_c(\tau) – p^T_c(\tau))$$
At large $\tau$, the soft probabilities are close to uniform, so gradient magnitudes are small for all classes—the $\tau^2$ compensation factor ensures this does not vanish entirely, but the effective signal-to-noise ratio decreases. The optimal $\tau$ is therefore a function of the teacher’s prediction confidence and the task difficulty.
4.3 Cross-Architecture Distillation
When teacher and student differ substantially in architecture (e.g., ResNet teacher, MobileNet student; or CNN teacher, ViT student), output-based KD remains applicable but feature-level losses require architectural alignment strategies. The DeiT distillation token is an elegant solution for ViT students: rather than adding a feature regression loss, a separate classification token trained exclusively on the teacher’s hard predictions is incorporated into the ViT attention mechanism, allowing the student to attend to teacher knowledge as part of its own processing.
4.4 Data-Free Distillation
A practically important variant is data-free distillation, where the original training data is unavailable (e.g., due to privacy constraints). Here, a generator is trained to produce synthetic inputs that maximize disagreement between teacher and student, and distillation is applied on these synthetic inputs. The choice of loss function interacts strongly with generator quality: KD on poor-quality synthetic data can transfer teacher miscalibration artifacts, while relational losses are more robust because they enforce structural properties without requiring absolute activation fidelity.
4.5 Failure Modes
Several failure modes are well-documented. First, oversmoothing: when $\tau$ is too high, the KD signal carries little information, and the student converges to a suboptimal solution indistinguishable from standard training. Second, capacity mismatch: a student forced to match a much larger teacher’s feature representations may experience conflicting gradients between the distillation loss and the task loss, degrading performance below the baseline of training without distillation—the regime identified by Mirzadeh et al. (2020). Third, layer selection sensitivity: FitNets and AT losses are sensitive to which teacher-student layer pairs are matched; mismatched semantic level (e.g., matching a shallow student layer to a deep teacher layer) can transfer spurious low-level statistics that do not generalize.
5. Conclusion
Knowledge distillation is not a single technique but a family of training objectives each encoding a different inductive bias about what is worth transferring from teacher to student. The KL divergence on soft targets transfers classification boundary geometry with a temperature-modulated information-theoretic signal. Feature regression transfers representational content but requires architectural alignment. Attention transfer provides a spatially informative, dimensionality-invariant alternative. Relational objectives preserve the metric structure of the embedding space. Contrastive objectives maximize the information bottleneck between teacher and student representations.
The practical implication is that no single loss dominates across all settings. For standard same-architecture compression, KD alone is often sufficient. For cross-architecture transfer or when intermediate representations are semantically important (e.g., for downstream tasks), CRD or RKD provide consistent gains. For very large capacity gaps, intermediate teacher assistants are necessary regardless of the loss choice. For data-free settings, relational losses are more robust.
The theoretical understanding of why distillation works—and why certain loss functions work better than others—remains incomplete. The connection between temperature scaling and Bayesian inference, the relationship between intermediate feature matching and learned inductive biases, and the information-theoretic limits of distillation are all active areas of research. A complete theory of knowledge distillation would have significant implications not just for model compression but for our understanding of what neural networks learn and how that knowledge can be transferred.
References
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. NIPS Deep Learning Workshop. arXiv:1503.02531.
- Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2015). FitNets: Hints for thin deep nets. ICLR 2015. arXiv:1412.6550.
- Zagoruyko, S., & Komodakis, N. (2017). Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. ICLR 2017. arXiv:1612.03928.
- Park, W., Kim, D., Lu, Y., & Cho, M. (2019). Relational knowledge distillation. CVPR 2019. arXiv:1904.05068.
- Tian, Y., Krishnan, D., & Isola, P. (2020). Contrastive representation distillation. ICLR 2020. arXiv:1910.10699.
- Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., & Ghasemzadeh, H. (2020). Improved knowledge distillation via teacher assistant. AAAI 2020. arXiv:1902.03393.
- Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & J�gou, H. (2021). Training data-efficient image transformers & distillation through attention. ICML 2021. arXiv:2012.12877.
- Oord, A. v. d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding. arXiv:1807.03748.
- Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 1789–1819.
- Cho, J. H., & Hariharan, B. (2019). On the efficacy of knowledge distillation. ICCV 2019. arXiv:1910.01348.