Speculative Decoding for Large Language Model Inference: Mechanisms, Theory, and Empirical Tradeoffs

Abstract

Auto-regressive decoding in large language models (LLMs) is inherently sequential: each token must be sampled before the next can be generated, binding inference latency tightly to model size. Speculative decoding (SD) breaks this bottleneck by using a lightweight draft model to propose multiple tokens in parallel, which the target model then verifies in a single forward pass. Provided the draft’s acceptance rate is sufficiently high, wall-clock decoding throughput increases without any change to the target distribution. This paper provides a rigorous treatment of speculative decoding: we derive the token-rejection–correction procedure, analyze acceptance rate as a function of distribution divergence, examine memory and compute tradeoffs, survey key variants including self-speculative and tree-based methods, and synthesize empirical results across model families. We find that speculative decoding is one of the most practically impactful inference optimization techniques available today, yet its benefits are highly sensitive to draft–target KL divergence and hardware memory bandwidth characteristics.

1. Introduction

The widespread deployment of LLMs has placed inference latency under intense scrutiny. While training is a one-time cost distributed across many GPUs, inference must satisfy real-time latency budgets for interactive applications. The core difficulty is that transformer-based models decode auto-regressively: the probability distribution over the next token, $p(x_t \mid x_{<t})$, is conditioned on all previously generated tokens, requiring a full forward pass per token step.

For a model with $d_{\text{model}}$ hidden dimensions and $L$ layers, each forward pass involves $O(L \cdot d_{\text{model}}^2)$ multiply-accumulate operations. At inference time with batch size 1, modern accelerators are memory-bandwidth bound rather than compute bound: the bottleneck is loading model weights from HBM to SRAM caches, not arithmetic throughput. This means the FLOPs-per-token can be nearly doubled (by processing 2 tokens at once) with negligible wall-clock overhead, provided both tokens can be verified in a single forward pass.

Speculative decoding exploits this asymmetry. First introduced by Leviathan et al. (2023) and independently by Chen et al. (2023) under the name speculative sampling, the method uses a small draft model $q$ to auto-regressively generate $\gamma$ candidate tokens, then uses the large target model $p$ to evaluate all $\gamma$ positions in a single batched forward pass. Accepted draft tokens cost essentially nothing in latency terms; only the final correction step requires target-model inference.

The remainder of this paper is organized as follows. Section 2 surveys related work on inference optimization and the speculative decoding family. Section 3 develops the technical analysis in depth — the rejection-correction algorithm, acceptance rate theory, and compute tradeoffs. Section 4 discusses practical considerations and failure modes. Section 5 concludes with open problems.

2. Related Work

Speculative decoding sits within a broader landscape of LLM inference optimization. We briefly survey the most relevant prior work.

2.1 Inference-time Compute Optimization

Quantization reduces weight precision from FP16/BF16 to INT8 or INT4, shrinking the memory bandwidth burden (Dettmers et al., 2022; Frantar et al., 2022). While effective, quantization degrades model quality in ways that are hard to bound tightly, and it reduces weight-load time without fundamentally changing the sequential decoding bottleneck.

Continuous batching (Yu et al., 2022) and PagedAttention (Kwon et al., 2023) increase throughput by dynamically scheduling requests, exploiting GPU parallelism across concurrent users. These techniques improve server-side throughput but do not reduce per-request latency for single-user interactive settings.

Flash Attention (Dao et al., 2022) and its successors reduce attention’s memory footprint via kernel fusion, but again do not address the serial token generation bottleneck.

2.2 Parallel Decoding Precursors

Block-wise parallel decoding (Stern et al., 2018) trained auxiliary heads to predict future tokens, then verified them with the primary model. Jacobi decoding (Song et al., 2021) treats decoding as solving a fixed-point equation and iterates all tokens simultaneously. These approaches either require architectural changes or sacrifice exact target-distribution fidelity.

2.3 Speculative Decoding Proper

Leviathan et al. (2023) and Chen et al. (2023) independently proposed speculative decoding with a provable guarantee: if the draft model $q$ and target model $p$ satisfy certain conditions, the output distribution is identical to sampling directly from $p$. This losslessness property distinguishes SD from heuristic acceleration methods.

Subsequent work explored tree-structured speculation (Miao et al., 2024; Cai et al., 2024), self-speculative decoding using the target model’s own early layers as the draft (Zhang et al., 2023), and Medusa heads — multiple parallel prediction heads trained on top of the target model (Cai et al., 2024). Each variant trades engineering complexity for acceptance rate improvements in specific deployment scenarios.

3. Technical Analysis

3.1 The Rejection-Correction Algorithm

Let $p(\cdot \mid x_{<t})$ denote the target model’s distribution and $q(\cdot \mid x_{<t})$ the draft model’s distribution at position $t$. The speculative decoding procedure for a draft length $\gamma$ proceeds as follows:

Draft phase: Run the draft model auto-regressively to generate $\gamma$ candidate tokens $\tilde{x}_1, \dots, \tilde{x}_\gamma$.

Verification phase: Run the target model on the full sequence $[x_{<t}, \tilde{x}_1, \dots, \tilde{x}_\gamma]$ in a single forward pass to obtain distributions $p(\cdot \mid x_{<t}),\ p(\cdot \mid x_{<t}, \tilde{x}_1),\ \dots,\ p(\cdot \mid x_{<t}, \tilde{x}_1, \dots, \tilde{x}_{\gamma-1})$.

Rejection step: For each position $i = 1, \dots, \gamma$ in order, accept $\tilde{x}_i$ with probability:

$$\alpha_i = \min\!\left(1,\ \frac{p(\tilde{x}_i \mid x_{<t},\, \tilde{x}_{<i})}{q(\tilde{x}_i \mid x_{<t},\, \tilde{x}_{<i})}\right)$$

If $\tilde{x}_i$ is accepted, proceed to $i+1$. If rejected at position $i$, sample the correction token from:

$$p'(x) = \text{normalize}\!\left(\max\!\left(0,\ p(x \mid \cdot) – q(x \mid \cdot)\right)\right)$$

This correction distribution is well-defined and ensures the final output token distribution is exactly $p$. The proof follows from the standard rejection-sampling identity.

3.2 Expected Tokens Per Step

Let $\alpha = \mathbb{E}[\alpha_i]$ denote the mean acceptance probability across positions (assumed i.i.d. for analysis). The expected number of tokens accepted in one speculative step of length $\gamma$ is:

$$\mathbb{E}[\text{accepted}] = \frac{1 – \alpha^{\gamma+1}}{1 – \alpha}$$

This is because the step accepts all $\gamma$ drafts with probability $\alpha^\gamma$, the first $k$ with probability $\alpha^k(1-\alpha)$ for $k < \gamma$, plus one correction token is always generated. Summing:

$$\mathbb{E} = \sum_{k=0}^{\gamma-1} (k+1)\alpha^k(1-\alpha) + (\gamma+1)\alpha^\gamma = \frac{1-\alpha^{\gamma+1}}{1-\alpha}$$

When $\alpha \to 1$ (perfect draft), this approaches $\gamma + 1$, giving a $(\gamma + 1)\times$ speedup. When $\alpha \to 0$ (useless draft), we get 1 token per step — no gain but also no regression.

3.3 Speedup as a Function of Acceptance Rate

Let $c$ be the ratio of draft model cost to target model cost. Each speculative step costs approximately $\gamma \cdot c + 1$ target-model equivalents of compute, but produces $\mathbb{E}[\text{accepted}]$ tokens. The speedup over naive decoding is:

$$S = \frac{(1 – \alpha^{\gamma+1})/(1-\alpha)}{\gamma c + 1}$$

For typical values — $\alpha \approx 0.7$, $c \approx 0.05$ — the optimal $\gamma$ is around 4–6 tokens, consistent with empirical findings from Leviathan et al. (2023).

3.4 Acceptance Rate and Distribution Divergence

The acceptance rate $\alpha$ is closely related to the total variation distance between $p$ and $q$:

$$\mathbb{E}_{x \sim q}\!\left[\min\!\left(1,\ \frac{p(x)}{q(x)}\right)\right] = 1 – d_{\text{TV}}(p, q)$$

where $d_{\text{TV}}(p, q) = \frac{1}{2}\sum_x |p(x) – q(x)|$. A draft model that closely mimics the target in distribution achieves high acceptance rates. Importantly, KL divergence does not tightly predict $d_{\text{TV}}$ — perplexity differences between draft and target do not reliably predict acceptance rates.

3.5 Tree-Based Speculative Decoding

Tree-based approaches (Miao et al., 2024) generate a tree of draft sequences: at each position, $b$ candidate tokens are proposed, yielding a tree with up to $b^\gamma$ leaves. The target model evaluates all tree nodes in one batched forward pass using tree attention masks. The theoretical speedup for a tree of branching factor $b$ and depth $d$ is:

$$S_{\text{tree}} \approx \frac{\mathbb{E}[\text{depth of deepest accepted path}]}{b^d \cdot c + 1}$$

In practice, tree decoding achieves 20–40% additional speedup over chain speculative decoding.

3.6 Self-Speculative Decoding

Zhang et al. (2023) proposed self-speculative decoding, which uses the target model’s early transformer layers as the draft model. Given an $L$-layer transformer, the first $\ell$ layers produce a hidden state from which a lightweight prediction head samples draft tokens. The full $L$-layer model then verifies these. Empirically, Zhang et al. (2023) report $1.7$–$2.0\times$ speedup on LLaMA-2 70B without any separate model.

4. Discussion

4.1 Hardware Considerations

Speculative decoding’s value is strongest in memory-bandwidth-bound regimes — specifically, single-batch or low-batch inference on modern A100/H100 GPUs. In high-throughput server settings with large batches (e.g., $B = 256$), the GPU becomes compute-bound and the speculative decoding advantage diminishes.

4.2 Draft Model Selection

Selecting an appropriate draft model requires satisfying competing requirements: (1) low latency/cost; (2) high acceptance rate; and (3) architectural compatibility. Common practice is to use a model from the same family as the target — e.g., LLaMA-3-8B drafting for LLaMA-3-70B. Cross-family drafting typically yields lower acceptance rates despite similar benchmark perplexity.

4.3 Failure Modes and Limitations

Low-acceptance tasks: Tasks requiring precise stylistic alignment or specialized domains can exhibit $\alpha < 0.5$, making SD slower than naive decoding.

Temperature sensitivity: At temperature $T = 0$ (greedy decoding), acceptance reduces to a deterministic comparison: accept if $\arg\max p = \arg\max q$, which is sensitive to distributional alignment.

Streaming UX: Speculative decoding delivers tokens in bursts rather than at a uniform cadence, which can produce a visually jerky experience in streaming interfaces.

4.4 Medusa and Parallel Prediction Heads

Cai et al. (2024) propose Medusa — a set of $k$ additional prediction heads attached to the LLM’s final hidden state, each trained to predict $1, 2, \dots, k$ tokens ahead. Medusa achieves $1.8$–$2.8\times$ speedup on tasks where training data for the heads is available, without requiring a separate draft model.

5. Conclusion

Speculative decoding is a theoretically principled and empirically effective technique for reducing LLM inference latency without any change to output quality. The core insight — that memory-bandwidth-bound inference on modern hardware can verify multiple tokens as cheaply as verifying one — leads to a clean algorithm with well-understood guarantees. The expected tokens per step, $\frac{1-\alpha^{\gamma+1}}{1-\alpha}$, captures the essential tradeoff between draft quality and draft length.

Open problems remain: optimal dynamic draft length scheduling, the interaction between SD and quantized models, and how hardware generations will shift these tradeoffs. For practitioners, speculative decoding represents one of the highest-ROI inference optimizations available — significant latency reduction at zero quality cost, provided the right draft model is chosen.

References

Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. ICML 2023.
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., & Jumper, J. (2023). Accelerating large language model decoding with speculative sampling. arXiv:2302.01318.
Miao, X., et al. (2024). SpecInfer: Accelerating generative LLM serving with tree-based speculative inference and verification. ASPLOS 2024.
Cai, T., et al. (2024). Medusa: Simple LLM inference acceleration framework with multiple decoding heads. arXiv:2401.10774.
Zhang, J., et al. (2023). Draft & Verify: Lossless large language model acceleration via self-speculative decoding. arXiv:2309.08168.
Kwon, W., et al. (2023). Efficient memory management for large language model serving with PagedAttention. SOSP 2023.
Dao, T., et al. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. NeurIPS 2022.
Dettmers, T., et al. (2022). LLM.int8(): 8-bit matrix multiplication for transformers at scale. NeurIPS 2022.
Frantar, E., et al. (2022). GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv:2210.17323.
Stern, M., Shazeer, N., & Uszkoreit, J. (2018). Blockwise parallel decoding for deep autoregressive models. NeurIPS 2018.

Speculative Decoding for Large Language Model Inference: Mechanisms, Theory, and Empirical Tradeoffs

Speculative Decoding for Large Language Model Inference: Mechanisms, Theory, and Empirical Tradeoffs

Abstract

1. Introduction

2. Related Work

2.1 Inference-time Compute Optimization

2.2 Parallel Decoding Precursors

2.3 Speculative Decoding Proper

3. Technical Analysis

3.1 The Rejection-Correction Algorithm

3.2 Expected Tokens Per Step

3.3 Speedup as a Function of Acceptance Rate

3.4 Acceptance Rate and Distribution Divergence

3.5 Tree-Based Speculative Decoding

3.6 Self-Speculative Decoding

4. Discussion

4.1 Hardware Considerations

4.2 Draft Model Selection

4.3 Failure Modes and Limitations

4.4 Medusa and Parallel Prediction Heads

5. Conclusion

References

Leave a Comment Cancel Reply