Diffusion Language Models: Score Matching, Masked Diffusion, and the Non-Autoregressive Frontier

Abstract

Autoregressive language models have achieved remarkable performance across natural language tasks, yet their sequential generation process imposes fundamental latency constraints and prevents direct optimization over arbitrary output structures. Diffusion models, originally developed for continuous data such as images and audio, offer an alternative generative paradigm based on iterative denoising. Extending diffusion to discrete text is non-trivial: the core score-matching objective assumes a continuous data manifold, and naive discretization destroys the theoretical guarantees that make diffusion tractable. This paper reviews the theoretical landscape of diffusion language models, covering continuous relaxations, absorbing-state masked diffusion, and score-entropy objectives. We analyze the forward and reverse processes for discrete sequences, compare sampling strategies, and evaluate the empirical evidence for and against diffusion as a competitive alternative to autoregressive generation. We identify the open problems that currently prevent diffusion language models from matching autoregressive quality at scale, and outline the theoretical conditions under which the gap might close.

1. Introduction

The dominant paradigm for language generation is autoregressive modeling: given a sequence $x_1, x_2, \ldots, x_n$, a model learns the factorization

$$p(x_1, \ldots, x_n) = \prod_{t=1}^{n} p(x_t \mid x_1, \ldots, x_{t-1})$$

Each token is sampled left-to-right, conditioned on all prior tokens. This factorization is exact, covers the full distribution over sequences, and admits efficient maximum likelihood training via cross-entropy. Modern large language models — GPT-4, LLaMA, Gemini — are all fundamentally autoregressive, and their success is well-documented.

Despite this success, autoregressive generation has structural limitations. Generation is strictly sequential: each forward pass of the network produces exactly one token, so generating a 1000-token response requires 1000 serial network evaluations. Parallelism is limited to the batching dimension. Furthermore, the left-to-right factorization is asymmetric: the model conditions the $t$-th token on all prior context, but never the reverse. This makes tasks that require global coherence — filling in the middle, constrained generation, structured output — awkward to model directly.

Diffusion models for images (Ho et al., 2020; Song et al., 2021) solve generation through iterative denoising. Starting from Gaussian noise, the model performs $T$ denoising steps, each a small correction guided by a learned score function. The process is inherently parallel within each step, and inference can be accelerated by reducing the number of steps. Critically, diffusion allows bidirectional conditioning: the denoiser attends to the entire partially-denoised sequence at each step, not just a prefix.

The challenge of applying diffusion to language is that text is discrete. Score functions are defined as gradients of log-density with respect to continuous inputs; on a discrete token vocabulary, such gradients do not exist in the usual sense. Several approaches have been proposed: embedding-space diffusion that operates in continuous representations (Li et al., 2022), score-entropy objectives that extend score matching to discrete spaces (Lou et al., 2023), and masked diffusion models that use absorbing-state Markov chains (Austin et al., 2021; Sahoo et al., 2024).

This paper provides a technical review of these approaches. Section 2 reviews the continuous diffusion framework and prior work on non-autoregressive language models. Section 3 analyzes the mathematical extensions required to handle discrete sequences, covering masked diffusion, score-entropy, and continuous relaxations. Section 4 discusses empirical results, practical tradeoffs, and real-world deployment considerations. Section 5 concludes with open problems and theoretical conditions for progress.

2. Related Work

Non-autoregressive sequence generation predates diffusion models. The Levenshtein Transformer (Gu et al., 2019) introduced deletion and insertion operations for iterative refinement of token sequences. Masked non-autoregressive transformers (Ghazvininejad et al., 2019) — the MASK-PREDICT family — generate all tokens in parallel and iteratively refine masked positions, finding application in machine translation. These models demonstrated that parallel generation is achievable at modest quality cost, but the refinement procedure is heuristic rather than grounded in a principled probabilistic framework.

Score-based generative models (Song and Ermon, 2019) and denoising diffusion probabilistic models (Ho et al., 2020) established the theoretical foundation for modern diffusion. The forward process adds Gaussian noise to data over $T$ steps; the reverse process learns to denoise, parameterized via a noise prediction network. Song et al. (2021) unified these frameworks via stochastic differential equations (SDEs), expressing the forward process as $d\mathbf{x} = f(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}$ and the reverse as an SDE driven by the score $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$.

For discrete data, Austin et al. (2021) proposed the D3PM (Discrete Denoising Diffusion Probabilistic Models) framework, which generalizes the forward process to absorbing (masked), uniform, and token-distance-based transition matrices. The absorbing-state process masks tokens with increasing probability, yielding a process analogous to BERT’s masked language modeling objective at each timestep. D3PM showed that the BERT objective emerges as a special case of the variational lower bound for absorbing-state diffusion.

Continuous embedding-space diffusion for text was introduced by Li et al. (2022) in Diffusion-LM. The approach embeds discrete tokens into a continuous space, applies standard Gaussian diffusion, and decodes via a learned rounding function. This recovers the standard diffusion machinery but introduces a train-test mismatch: during inference, embeddings must be rounded to the nearest vocabulary token at each step, introducing cascading errors. Gong et al. (2023) proposed DiffuSeq for conditional text generation using a similar embedding-space approach.

Score entropy for discrete diffusion was introduced by Lou et al. (2023), who noted that the denoising score matching objective does not extend directly to discrete spaces because scores are gradients that require continuity. They defined a score-entropy loss — a discrete analog of score matching — and showed it provides a principled training objective for discrete diffusion models. Their SEDD (Score Entropy Discrete Diffusion) model achieved competitive perplexity on language modeling benchmarks.

Masked diffusion models with simplified objectives were investigated by Sahoo et al. (2024) and Shi et al. (2024), who derived MDLM (Masked Diffusion Language Model). MDLM showed that, under the absorbing-state forward process, the ELBO reduces to a simple weighted masked language modeling loss, eliminating the need for complex score-entropy formulations. Concurrent work by Ou et al. (2024) on MDLM variants explored continuous-time limits. The MDLM line of work brought diffusion language model training complexity close to standard language model training, while maintaining the theoretical guarantees of the diffusion framework.

Speculative decoding connections were explored by Chen et al. (2024), who noted that diffusion language models can serve as draft models for autoregressive verification. Meanwhile, Nie et al. (2024) studied the relationship between masked diffusion and the BERT family of models, showing that BERT pretraining is a degenerate limit of masked diffusion with a uniform noise schedule.

3. Technical Analysis

3.1 Continuous Diffusion: Background

Let $\mathbf{x}_0 \in \mathbb{R}^d$ be a data point. The forward process defines a Markov chain:

$$q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\, (1 – \bar{\alpha}_t)\mathbf{I})$$

where $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$ and $\{\alpha_t\}$ is a noise schedule satisfying $\alpha_t \in (0,1)$, $\bar{\alpha}_T \approx 0$. The reverse process is learned as:

$$p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1};\, \mu_\theta(\mathbf{x}_t, t),\, \sigma_t^2 \mathbf{I})$$

Training minimizes the variational lower bound (ELBO), which decomposes into a reconstruction term and KL divergences at each timestep. Ho et al. (2020) showed this simplifies to a noise prediction objective:

$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \| \boldsymbol{\epsilon} – \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \|^2 \right]$$

where $\boldsymbol{\epsilon} \sim \mathcal{N}(0, \mathbf{I})$ and $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_t}\,\boldsymbol{\epsilon}$.

3.2 Discrete Forward Processes

For a vocabulary $\mathcal{V}$ with $|\mathcal{V}| = V$, let $x_0 \in \{1, \ldots, V\}$ be a token. The discrete forward process is defined by a transition matrix $\mathbf{Q}_t \in \mathbb{R}^{V \times V}$ where entry $[\mathbf{Q}_t]_{ij} = q(x_t = j \mid x_{t-1} = i)$. The marginal at timestep $t$ is:

$$q(x_t \mid x_0) = \mathbf{e}_{x_0}^\top \bar{\mathbf{Q}}_t, \quad \bar{\mathbf{Q}}_t = \mathbf{Q}_1 \mathbf{Q}_2 \cdots \mathbf{Q}_t$$

where $\mathbf{e}_{x_0}$ is the one-hot vector. Austin et al. (2021) proposed three structured choices for $\mathbf{Q}_t$:

Uniform diffusion: $\mathbf{Q}_t = (1 – \beta_t) \mathbf{I} + \beta_t / V \cdot \mathbf{1}\mathbf{1}^\top$, which corrupts tokens to uniformly random tokens.
Absorbing (masked) diffusion: $\mathbf{Q}_t = (1 – \beta_t) \mathbf{I} + \beta_t \mathbf{e}_{[\text{MASK}]} \mathbf{1}^\top$, which transitions each token to a special [MASK] token with probability $\beta_t$.
Token distance diffusion: Transitions prefer nearby tokens in an embedding space, using a softmax kernel over token distances.

The absorbing process is particularly tractable. At timestep $t$, each token is independently either kept as $x_0$ or replaced by [MASK]. The posterior $q(x_{t-1} \mid x_t, x_0)$ has a closed form:

$$q(x_{t-1} \mid x_t = [\text{MASK}], x_0) \propto \beta_{t-1} \cdot \mathbf{e}_{[\text{MASK}]} + (1 – \beta_{t-1}) \cdot \mathbf{e}_{x_0}$$
$$q(x_{t-1} \mid x_t = x_0, x_0) = \mathbf{e}_{x_0}$$

This means the reverse step either unmasks a token (sampling $x_0$) or keeps it masked.

3.3 The MDLM Training Objective

Sahoo et al. (2024) showed that for the absorbing-state process with a cumulative mask probability $\bar{\alpha}_t = \prod_{s=1}^{t} (1 – \beta_s)$, the ELBO reduces to:

$$\mathcal{L}_{\text{MDLM}} = \mathbb{E}_{t, x_0} \left[ w(t) \sum_{i=1}^{n} \mathbf{1}[x_t^{(i)} = [\text{MASK}]] \cdot \log p_\theta(x_0^{(i)} \mid x_t) \right]$$

where $w(t) = \dot{\bar{\alpha}}_t / \bar{\alpha}_t$ is a time-dependent weighting derived from the noise schedule, $n$ is the sequence length, and the sum is over masked positions. The key insight is that this is exactly a weighted masked language modeling (MLM) objective — the model predicts the original token at each masked position, summed over masked positions, with a schedule-dependent weight. Standard MLM (as in BERT) corresponds to setting $w(t) = \text{const}$ and sampling the masking rate uniformly, which is precisely the degenerate limit identified by Nie et al. (2024).

3.4 Score Entropy for Discrete Diffusion

Lou et al. (2023) took a different approach. The (Stein) score of a continuous distribution $p$ is $\nabla_x \log p(x)$. For a discrete distribution $p$ over $\{1, \ldots, V\}$, the analog is the ratio vector $s_j = p(x = j) / p(x = x_0)$ for $j \neq x_0$. The score-entropy loss is defined as:

$$\mathcal{L}_{\text{SE}} = \mathbb{E}_{x_t \sim q_t} \left[ \sum_{j \neq x_t} Q_t(j \mid x_t) \left( s_\theta(x_t, t)_j – \log s_\theta(x_t, t)_j – 1 \right) \right]$$

where $Q_t(j \mid x_t)$ is the off-diagonal rate of the continuous-time transition kernel, and $s_\theta$ is a learned ratio function. This can be converted to a denoising form:

$$\mathcal{L}_{\text{DSM}} = \mathbb{E}_{x_0, x_t} \left[ \sum_{j \neq x_t} Q_t(j \mid x_t) \left( \frac{q(x_t \mid x_0)}{q(x_t = j \mid x_0)} s_\theta(x_t, t)_j – \log s_\theta(x_t, t)_j \right) \right]$$

Lou et al. (2023) showed that this objective is equivalent to learning the posterior ratios $q(x_0 \mid x_t)$ and used it to train SEDD models on language data, achieving perplexities competitive with GPT-2 on text8 and OpenWebText benchmarks while enabling non-autoregressive generation.

3.5 Continuous Relaxation Approaches

Li et al. (2022) embedded tokens via a learned embedding matrix $\mathbf{E} \in \mathbb{R}^{V \times d}$, mapping $x_0 \mapsto \mathbf{e}(x_0) = \mathbf{E}[x_0]$, and applied continuous Gaussian diffusion in $\mathbb{R}^d$. The reverse process generates a continuous embedding $\hat{\mathbf{e}}_0$, which is then rounded to the nearest token:
$$\hat{x}_0 = \arg\max_{v \in \mathcal{V}} \, \mathbf{e}(v)^\top \hat{\mathbf{e}}_0$$
The embedding space is regularized so that clean embeddings $\mathbf{e}(x_0)$ remain close to the vocabulary simplex. However, at each denoising step $t$, the intermediate $\mathbf{x}_t$ is a linear combination of the clean embedding and noise, and the network must operate in a space where token identity is entangled with noise magnitude. This embedding-space approach avoids the discrete score problem but introduces a rounding bottleneck that limits generation quality. Gong et al. (2023) extended this to conditional generation with cross-attention over a source sequence.

3.6 Sampling and Inference

For masked diffusion models, inference proceeds by sampling $x_T$ (all tokens masked), then applying the reverse process $T$ times. At each step $t$, the denoiser predicts $p_\theta(x_0 \mid x_t)$, from which the posterior $p_\theta(x_{t-1} \mid x_t)$ is computed. This can be done in parallel across all sequence positions, since each position is conditionally independent given $x_t$. The generation cost is therefore $T$ forward passes of the denoiser, each of which attends to the full (partially unmasked) sequence via bidirectional attention.

Inference efficiency relative to autoregressive models depends on the relationship between $T$ (diffusion steps) and sequence length $n$. For autoregressive generation, cost is $O(n \cdot C)$ where $C$ is the per-token forward pass cost. For masked diffusion with $T$ steps, cost is $O(T \cdot C’)$ where $C’$ is the cost of a full-sequence bidirectional forward pass. With key-value caching, the autoregressive $C$ scales as $O(n^2)$ in total (quadratic in sequence length), while the diffusion $C’$ per step is $O(n^2)$ due to full self-attention, but requires only $T \ll n$ steps for short texts. This suggests diffusion models are more efficient when $T \ll n$, which motivates few-step sampling via distillation (analogous to consistency models in image generation).

4. Discussion

4.1 Empirical Comparisons

Lou et al. (2023) reported that SEDD-medium (169M parameters) achieves a perplexity of 31.7 on OpenWebText using 1000 generation steps, compared to GPT-2 medium (345M parameters) at 25.4 perplexity. This gap — larger models, higher perplexity — reflects the fundamental efficiency disadvantage: diffusion models use parameters for denoising at all timesteps simultaneously, while autoregressive models allocate all capacity to a single left-to-right pass. Sahoo et al. (2024) reported similar gaps for MDLM at matched parameter counts.

The quality gap widens significantly on tasks requiring long-form coherence. Diffusion models generate all tokens in parallel with shared context, which theoretically should enable better global consistency, but in practice, the denoising network must predict token identity from limited context at early timesteps when most tokens are masked, leading to inconsistency. Approaches like MDLM partially address this by using continuous-time masking rates that concentrate early steps near $t = T$ (mostly masked), but the fundamental tension between local and global consistency remains.

4.2 Conditional Generation and Control

One theoretical advantage of diffusion language models is the ease of conditioning. In autoregressive models, conditional generation (infilling, constrained decoding) requires careful modification of the sampling procedure; greedy constrained decoding is NP-hard in general (Khalifa et al., 2021). In masked diffusion, conditioning on observed tokens is straightforward: simply keep those tokens unmasked throughout the forward process and exclude them from the denoising loss. Li et al. (2022) demonstrated this for text infilling, showing that Diffusion-LM can condition on arbitrary token subsets without retraining.

SEDD and MDLM inherit this property. Shi et al. (2024) demonstrated MDLM on protein sequence generation, where the ability to condition on known subsequences (e.g., scaffold regions) is directly useful. Similarly, molecules represented as token sequences admit natural infilling under masked diffusion. These structured generation settings represent a genuine advantage over autoregressive models.

4.3 Scaling Behavior

A critical open question is whether diffusion language models exhibit similar scaling laws to autoregressive models. Hoffmann et al. (2022) established Chinchilla scaling laws for autoregressive LLMs: compute-optimal training requires scaling model parameters and training tokens proportionally, with perplexity following a power law in compute. No equivalent scaling analysis exists for diffusion language models. Preliminary evidence from MDLM suggests scaling follows similar trends, but the absolute perplexity values at matched compute remain worse than autoregressive equivalents. Whether this gap closes, narrows, or widens at scale is unknown.

4.4 Real-World Deployments

As of early 2025, no large-scale commercial language model uses discrete diffusion as the primary generation mechanism. Stability AI and Hugging Face have released small diffusion language model checkpoints for research use. The primary deployment contexts for diffusion-based text generation are structured domains: protein sequence design (Luo et al., 2022), small molecule generation (Hoogeboom et al., 2022), and code infilling in constrained settings. These domains benefit from the bidirectional conditioning and parallel generation properties while being relatively insensitive to the perplexity gap at scale.

Latency improvements have been demonstrated in academic settings. Chang et al. (2022) showed that masked generative image transformers (MaskGIT), which are closely related to masked diffusion, can generate images in 8 parallel decoding steps versus hundreds of sequential autoregressive steps, with competitive quality. Analogous results for language remain limited, partly because text generation quality is more sensitive to small local errors than image generation, where perceptual smoothing is possible.

5. Conclusion

Diffusion language models constitute a theoretically principled alternative to autoregressive generation, grounded in discrete Markov chain diffusion, score-entropy objectives, or continuous relaxations. The masked diffusion framework — particularly MDLM — reduces to a weighted masked language modeling objective that is simple to implement and theoretically grounded. Despite this theoretical appeal, diffusion language models currently underperform autoregressive models at matched parameter and compute budgets on standard language modeling benchmarks, with perplexity gaps of 5–10 points at the 100M–1B scale. The most significant open problems are: (1) establishing scaling laws for discrete diffusion to determine whether the quality gap persists at scale, (2) developing few-step sampling methods analogous to consistency distillation that reduce generation cost without quality loss, and (3) characterizing the inductive biases introduced by different noise schedules and their downstream effects on generated text quality.

References

Austin, J., Johnson, D., Ho, J., Tarlow, D., and van den Berg, R. (2021). Structured Denoising Diffusion Models in Discrete State-Spaces. NeurIPS 2021.

Chang, H., Zhang, H., Jiang, L., Liu, C., and Freeman, W. T. (2022). MaskGIT: Masked Generative Image Transformer. CVPR 2022.

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2024). Accelerating Large Language Model Decoding with Speculative Sampling. arXiv:2302.01318.

Ghazvininejad, M., Levy, O., Liu, Y., and Zettlemoyer, L. (2019). Mask-Predict: Parallel Decoding of Conditional Masked Language Models. EMNLP 2019.

Gong, S., Li, M., Feng, J., Wu, Z., and Kong, L. (2023). DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models. ICLR 2023.

Gu, J., Wang, C., and Zhao, J. (2019). Levenshtein Transformer. NeurIPS 2019.

Ho, J., Jain, A., and Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020.

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskiy, E., Cai, T., Rutherford, E., de Wiele, T. V., Hendricks, L. A., Welbl, J., Clark, A., Cassirer, A., Henning, J., Latysheva, E., and Sifre, L. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556.

Hoogeboom, E., Satorras, V. G., Vignac, C., and Welling, M. (2022). Equivariant Diffusion for Molecule Generation in 3D. ICML 2022.

Khalifa, M., Elsahar, H., and Dymetman, M. (2021). A Distributional Approach to Controlled Text Generation. ICLR 2021.

Li, X. L., Thickstun, J., Gulrajani, I., Liang, P., and Hashimoto, T. (2022). Diffusion-LM Improves Controllable Text Generation. NeurIPS 2022.

Lou, A., Meng, C., and Ermon, S. (2023). Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution. arXiv:2310.16834.

Luo, S., Su, Y., Peng, X., Wang, S., Peng, J., and Ma, J. (2022). Antigen-Specific Antibody Design and Optimization with Diffusion-Based Generative Models. NeurIPS 2022.

Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., and Anandkumar, A. (2024). A Blessing of Randomness: BERT Is Not Only A Masked Language Model But Also A Noise-Corrupted Text Model. ICLR 2024.

Ou, J., Shi, J., Kaiser, L., Song, Y., Tang, J., and Yu, Y. (2024). Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data. arXiv:2406.03736.

Sahoo, S. S., Arriola, M., Schiff, Y., Gokaslan, A., Marroquin, E., Chiu, J. T., Rush, A., and Kuleshov, V. (2024). Simple and Effective Masked Diffusion Language Models. NeurIPS 2024.

Shi, J., Han, K., Wang, Z., Doucet, A., and Titsias, M. (2024). Simplified and Generalized Masked Diffusion for Discrete Data. NeurIPS 2024.

Song, Y. and Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021.