Direct Preference Optimization: Bypassing the Reward Model in RLHF and the Mathematics of Implicit Reward Learning

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning large language models with human values. Its standard pipeline requires training an explicit reward model on preference data, then optimizing the language model against that reward signal via PPO. Direct Preference Optimization (DPO) (Rafailov et al., 2023) proposes a radical simplification: it shows that under the Bradley-Terry preference model with a KL-divergence regularizer, the optimal policy can be extracted in closed form from preference data alone, eliminating the reward model entirely. This paper examines DPO’s mathematical derivation, its relationship to the RLHF objective, practical training dynamics, and the empirical evidence for and against its adoption. We analyze theoretical limitations including distributional shift vulnerability, the implicit reward’s geometry, and recent extensions such as IPO, KTO, SimPO, and ORPO that address DPO’s known failure modes. The analysis positions DPO not as a universal replacement for RLHF but as a theoretically principled simplification whose practical limits inform the broader trajectory of preference-based alignment.

1. Introduction

Training large language models (LLMs) to follow instructions and behave helpfully requires more than next-token prediction on internet corpora. The resulting models, absent targeted intervention, exhibit a range of undesirable behaviors: hallucination, sycophancy, harmful content generation, and poor instruction adherence. Reinforcement Learning from Human Feedback (Christiano et al., 2017; Ouyang et al., 2022) addresses this by fine-tuning models against a learned reward signal derived from human pairwise preferences.

The canonical RLHF pipeline proceeds in three stages. First, a supervised fine-tuning (SFT) stage adapts the pretrained model to the target domain. Second, a reward model $r_\phi(x, y)$ is trained by maximum likelihood on preference pairs $(y_w \succ y_l | x)$, where $y_w$ is the preferred completion and $y_l$ the rejected one. Third, the SFT model is optimized against this reward signal subject to a KL-divergence constraint to prevent reward hacking and maintain fluency:

$$\max_{\pi_\theta} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) \right] – \beta \cdot D_{\mathrm{KL}} \left[ \pi_\theta(y|x) \| \pi_{\mathrm{ref}}(y|x) \right]$$

This pipeline has produced aligned models of remarkable capability (Bai et al., 2022; OpenAI, 2023). However, it suffers from significant engineering complexity: reward model training introduces a separate optimization loop, PPO’s on-policy sampling is computationally expensive, and the combination of reward hacking, mode collapse, and training instability demands careful hyperparameter tuning.

Rafailov et al. (2023) demonstrated that under certain assumptions this three-stage pipeline can be collapsed into a single stage. By observing that the KL-constrained reward maximization objective has a closed-form optimal solution, and that this solution implicitly defines the reward in terms of the policy ratio, one can re-parameterize the reward model training directly in terms of the policy being optimized. The result is a simple binary cross-entropy objective on preference pairs, operating directly on the language model without ever instantiating an explicit reward model or running reinforcement learning.

This paper provides a systematic technical examination of DPO: its derivation, the assumptions it depends on, the geometry of the implicit reward, empirical performance evidence, and the growing family of DPO variants that respond to its limitations.

2. Related Work

Christiano et al. (2017) introduced RLHF in the context of robotic locomotion and Atari environments, establishing the paradigm of learning reward models from human comparisons. Their work demonstrated that scalar reward functions could capture nuanced human preferences better than hand-designed objectives.

Ouyang et al. (2022) scaled RLHF to large language models, producing InstructGPT. They showed that models trained with RLHF on human preference data significantly outperformed models an order of magnitude larger on instruction-following tasks as judged by human evaluators, establishing the template for alignment fine-tuning.

Stiennon et al. (2020) applied RLHF to abstractive summarization, demonstrating that preference-trained models outperformed supervised fine-tuning baselines on human quality judgments even when trained on substantially less human annotation, highlighting the data efficiency of preference learning.

Rafailov et al. (2023) introduced DPO, proving mathematically that the RLHF KL-constrained objective implicitly defines the reward as a function of the policy ratio, enabling direct optimization of the language model on preference data. Their empirical results showed competitive or superior performance to PPO-based RLHF on summarization and dialogue tasks.

Azar et al. (2023) identified that DPO’s implicit reward is unbounded: the model can reduce loss to zero by assigning arbitrarily high probability to preferred responses regardless of their absolute quality. They proposed Identity Preference Optimization (IPO), which adds a regularization term ensuring the implicit reward remains calibrated against an absolute reference rather than only contrastively.

Ethayarajh et al. (2024) proposed KTO (Kahneman-Tversky Optimization), grounding preference optimization in prospect theory. Unlike DPO, KTO operates on individual responses labeled as desirable or undesirable without requiring paired comparisons, substantially reducing annotation cost.

Meng et al. (2024) introduced SimPO, replacing DPO’s policy ratio reward with a length-normalized reward derived directly from the policy’s log-probabilities, finding better calibration and reduced length bias in model outputs.

3. Technical Analysis

3.1 The Closed-Form Optimal Policy

The key theoretical insight of DPO begins with recognizing that the KL-constrained RLHF objective has a known optimal solution. Given the optimization problem:

$$\max_{\pi} \mathbb{E}_{x, y \sim \pi} \left[ r(x, y) \right] – \beta \cdot D_{\mathrm{KL}} \left[ \pi(y|x) \| \pi_{\mathrm{ref}}(y|x) \right]$$

This is a variational problem in the space of probability distributions. Writing the KL divergence explicitly and applying calculus of variations, one finds that the unique optimal policy takes the form:

$$\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\mathrm{ref}}(y|x) \exp\left( \frac{r(x,y)}{\beta} \right)$$

where $Z(x) = \sum_y \pi_{\mathrm{ref}}(y|x) \exp(r(x,y)/\beta)$ is the partition function. This is the Gibbs distribution induced by tilting the reference policy by the reward, with temperature $\beta$.

This expression can be inverted to express the reward in terms of the policy:

$$r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\mathrm{ref}}(y|x)} + \beta \log Z(x)$$

Since $\log Z(x)$ does not depend on $y$, it cancels in pairwise comparisons. Under the Bradley-Terry preference model:

$$p(y_w \succ y_l | x) = \sigma(r(x, y_w) – r(x, y_l))$$

the reward difference becomes:

$$r(x, y_w) – r(x, y_l) = \beta \log \frac{\pi^*(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} – \beta \log \frac{\pi^*(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}$$

Substituting into the Bradley-Terry likelihood and maximizing over the policy $\pi_\theta$ in place of $\pi^*$ yields the DPO objective:

$$\mathcal{L}_{\mathrm{DPO}}(\pi_\theta; \pi_{\mathrm{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} – \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)} \right) \right]$$

This is a standard binary cross-entropy loss on the difference of log-probability ratios. No reward model is required; no RL loop runs. The implicit reward at any point during training is:

$$\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\mathrm{ref}}(y|x)}$$

3.2 Gradient Analysis and Training Dynamics

The gradient of the DPO loss with respect to policy parameters $\theta$ is informative. Let $h_\theta = \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} – \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)}$. Then:

$$\nabla_\theta \mathcal{L}_{\mathrm{DPO}} = -\beta (1 – \sigma(h_\theta)) \left[ \nabla_\theta \log \pi_\theta(y_w|x) – \nabla_\theta \log \pi_\theta(y_l|x) \right]$$

The term $(1 – \sigma(h_\theta))$ acts as an adaptive weight: when the model already correctly ranks the pair (large positive $h_\theta$), the gradient contribution is small. When the pair is misranked, the gradient is large. This is analogous to the weighting in standard margin classifiers and ensures the optimization focuses computational effort on uncertain examples.

The gradient updates simultaneously increase $\log \pi_\theta(y_w|x)$ and decrease $\log \pi_\theta(y_l|x)$. In practice, Rafailov et al. noted that training tends to decrease $\log \pi_\theta(y_l|x)$ more aggressively than it increases $\log \pi_\theta(y_w|x)$ — a phenomenon sometimes called the suppression imbalance — which can lead to the model becoming overly conservative or generating degenerate distributions for rejected completions.

3.3 Distributional Shift and the Off-Policy Problem

DPO’s derivation assumes that preferences are collected under the optimal (or current) policy. In practice, preference datasets are typically collected under a fixed reference policy $\pi_{\mathrm{ref}}$, often the SFT model. The optimization then proceeds on this fixed offline dataset rather than on-policy samples.

This creates a distributional shift problem. The implicit reward $\hat{r}_\theta(x, y)$ is evaluated on pairs $(y_w, y_l)$ drawn from $\pi_{\mathrm{ref}}$, but the goal is to align $\pi_\theta$ which may deviate substantially from $\pi_{\mathrm{ref}}$ as training progresses. Completions that $\pi_\theta$ assigns high probability to may lie outside the support of the training distribution, receiving no gradient signal regardless of their quality.

Formally, if $\pi_\theta$ places significant mass on completion $y$ but $p_\mathcal{D}(y|x) \approx 0$, the loss provides no learning signal for those continuations. This is the standard limitation of offline RL methods and is precisely what on-policy PPO sampling addresses by continually refreshing the training distribution.

Huang et al. (2024) quantified this by showing that DPO’s effective sample efficiency degrades sharply as the model moves away from the data distribution, and proposed iterative DPO approaches where preference data is collected from the current policy checkpoint periodically — at the cost of recovering some of the pipeline complexity DPO was meant to eliminate.

3.4 The Unbounded Reward Problem and IPO

A more subtle issue is that DPO’s implicit reward is not inherently bounded. The loss function reaches its minimum when:

$$\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\mathrm{ref}}(y_w|x)} – \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\mathrm{ref}}(y_l|x)} \to +\infty$$

This can be achieved by driving $\pi_\theta(y_l|x) \to 0$, i.e., by assigning vanishingly small probability to all rejected completions — irrespective of whether they have any meaningful absolute quality. The model effectively memorizes which completions were rejected rather than learning underlying quality signals.

IPO (Azar et al., 2023) addresses this by modifying the objective to regularize the reward margin toward a target value of $\frac{1}{2\beta}$ rather than minimizing only contrastive loss:

$$\mathcal{L}_{\mathrm{IPO}} = \mathbb{E} \left[ \left( h_\theta – \frac{1}{2\beta} \right)^2 \right]$$

This squares the margin rather than passing it through a sigmoid, preventing the optimization from driving rewards to infinity and maintaining more calibrated preference representations.

3.5 Length Bias and SimPO

A practical failure mode of DPO is length exploitation. Since $\log \pi_\theta(y|x) = \sum_t \log \pi_\theta(y_t | x, y_{

SimPO (Meng et al., 2024) addresses this by normalizing log-probabilities by sequence length before computing the reward signal, and by removing the reference model from the reward computation entirely:

$$\hat{r}_{\mathrm{SimPO}}(x, y) = \frac{\beta}{|y|} \sum_{t=1}^{|y|} \log \pi_\theta(y_t | x, y_{where $\gamma$ is a target reward margin. This both resolves length bias and reduces memory requirements by eliminating the reference model forward pass during training.

4. Discussion

4.1 When DPO Works and When It Doesn’t

Empirically, DPO performs competitively with PPO-based RLHF on tasks where the preference data distribution closely matches the target distribution at inference time, and where the SFT model already generates plausible completions. Under these conditions, the closed-form derivation’s assumptions hold reasonably well, and the simplicity of the DPO objective allows stable, efficient training.

The picture changes in two regimes. First, when the task requires exploration — generating novel completions not well-represented in the preference dataset — DPO’s off-policy nature becomes a liability. PPO’s on-policy sampling naturally discovers high-reward regions of response space even when they’re absent from the initial dataset. DPO cannot bootstrap such discovery from fixed preference data. This is why on-policy RLHF continues to dominate in settings like mathematical reasoning (Lightman et al., 2023; Shao et al., 2024) where the reward signal needs to guide exploration toward qualitatively different solution strategies.

Second, DPO is sensitive to the quality of the preference dataset. Because its training signal is purely contrastive, it can overfit to the noise in pairwise judgments. Reward models, by training on the same data, can average across many predictions and exhibit smoother reward landscapes. The indirect nature of DPO’s reward makes it harder to diagnose when training goes wrong.

4.2 The Annotation Efficiency Argument

One often-cited practical advantage of DPO over PPO is reduced annotation burden: once a preference dataset exists, no further human annotation or reward model queries are needed during training. KTO (Ethayarajh et al., 2024) extends this further by operating on pointwise rather than pairwise annotations, reducing the combinatorial annotation burden from $O(n^2)$ pairs to $O(n)$ single-response judgments. In settings where annotation is expensive, this represents a genuine practical advantage, independent of performance considerations.

4.3 DPO in the Modern Alignment Stack

In production systems, DPO and its variants have increasingly been used as a fine-tuning stage applied after initial RLHF training, rather than as a wholesale replacement. This hybrid approach — using PPO for exploration-heavy initial alignment, then refining with DPO on accumulated preference data — attempts to capture the complementary strengths of each approach. The result is a more modular alignment pipeline where the choice of preference optimization algorithm is treated as a design parameter rather than a categorical commitment.

The emergence of ORPO (Hong et al., 2024), which folds preference optimization directly into the SFT cross-entropy objective through a log-odds penalty term, represents a further step toward integrating alignment into the core training loop rather than treating it as a separate post-processing stage. This trajectory suggests that the field is converging toward alignment methods that operate continuously during pretraining rather than episodically during post-training.

4.4 Theoretical Gaps and Open Questions

Several theoretical questions remain unresolved. The Bradley-Terry model assumes transitivity of preferences — that if $A \succ B$ and $B \succ C$, then $A \succ C$ — but human preferences are frequently intransitive, particularly across contexts. How preference optimization methods behave under systematic violations of this assumption is not well understood. The relationship between DPO’s implicit reward geometry and the semantic structure of the response space also remains opaque: it is unclear whether the implicit rewards learned by DPO correspond to coherent quality dimensions or to statistical artifacts of the preference data collection process.

5. Conclusion

Direct Preference Optimization represents a genuinely elegant theoretical contribution: the observation that the RLHF KL-constrained objective’s optimal policy implicitly defines the reward model, enabling a reduction of the full three-stage pipeline to a single cross-entropy loss on preference pairs. This derivation is mathematically correct and practically consequential — DPO and its variants have been widely adopted in both research and production settings.

However, the elegance of the derivation should not obscure its assumptions. The off-policy training regime, the unbounded implicit reward, and the length exploitation problem are structural limitations that require either extensions (IPO, SimPO, KTO) or hybrid approaches combining DPO with on-policy methods. The practical choice between DPO and PPO-based RLHF is not one of correctness but of appropriateness to the task: for distribution-close fine-tuning with fixed preference data, DPO offers substantial engineering simplification; for exploration-heavy alignment in complex reasoning domains, on-policy methods retain decisive advantages.

The larger lesson may be about the nature of alignment research itself: the theoretical idealization that produces the DPO derivation — well-specified preferences, covered data distribution, tractable Bradley-Terry model — captures only part of the problem. Progress in alignment will require methods that are robust to the systematic deviations from these ideals that characterize real-world preference data collection and deployment.

References

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. NeurIPS 2023.
Christiano, P., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. NeurIPS 2017.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., … & Christiano, P. (2020). Learning to summarize from human feedback. NeurIPS 2020.
Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). A general theoretical paradigm to understand learning from human feedback. arXiv:2310.12036.
Ethayarajh, K., Xu, W., Muennighoff, N., Jurafsky, D., & Kiela, D. (2024). KTO: Model alignment as prospect theoretic optimization. ICML 2024.
Meng, Y., Xia, M., & Chen, D. (2024). SimPO: Simple preference optimization with a reference-free reward. NeurIPS 2024.
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., … & Cobbe, K. (2023). Let’s verify step by step. arXiv:2305.20050.
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., … & Guo, D. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv:2402.03300.
Hong, J., Lee, N., & Thorne, J. (2024). ORPO: Monolithic preference optimization without reference model. arXiv:2403.07691.
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., … & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862.
Huang, S., Noukhovitch, M., Hosseini, A., Agrawal, K., Courville, A., & Rajagopalan, T. (2024). The n+ implementation details of RLHF with PPO: A case study on TL;DR summarization. arXiv:2403.17031.