Abstract
Reinforcement learning from verifiable rewards has emerged as a powerful paradigm for training large language models (LLMs) to reason accurately and reliably. Group Relative Policy Optimization (GRPO), introduced by DeepSeek-AI, addresses a core inefficiency of prior policy gradient methods by eliminating the critic network and instead estimating advantages through intra-group normalization. This approach dramatically reduces memory overhead while preserving the credit assignment fidelity needed for complex multi-step reasoning tasks. In this post, we examine the theoretical foundations of GRPO, trace its lineage through Proximal Policy Optimization (PPO) and the broader RLHF literature, and analyze how verifiable reward signals — such as correctness checks on mathematical solutions — differ qualitatively from learned reward models. We discuss the conditions under which GRPO outperforms supervised fine-tuning and RLHF with neural reward models, and identify open problems including reward sparsity, format constraints, and the tension between generalization and reward exploitation. The evidence suggests GRPO represents a significant step toward scalable, verifiable self-improvement in language models, with implications extending beyond mathematics to code, logic, and scientific reasoning.
1. Introduction
Training large language models to reason — not just to retrieve or pattern-match — is one of the central challenges in modern machine learning. While pretraining on large corpora endows models with broad linguistic competence, it does not in itself instill reliable logical inference, multi-step mathematical deduction, or structured problem-solving. Supervised fine-tuning (SFT) on curated reasoning traces provides one approach, but it is fundamentally limited by the quality and coverage of available demonstrations: a model trained to imitate correct reasoning chains will fail on problems outside its training distribution and has no mechanism to self-correct.
Reinforcement learning offers an alternative. Rather than teaching a model what to output, RL teaches a model what to achieve. In the context of LLMs, this means optimizing a policy — the distribution over token sequences — to maximize some notion of correctness or quality. The key insight motivating recent work is that for structured domains like mathematics and code, verifiable rewards are available: a mathematical answer is objectively correct or not, a program either passes test cases or fails. This verification does not require a learned reward model; it is grounded in external truth.
Proximal Policy Optimization (PPO), the dominant RL algorithm in RLHF pipelines (Schulman et al., 2017), was not designed for this regime. PPO requires a value function (critic) trained concurrently with the policy, doubling memory requirements and introducing a potentially miscalibrated baseline. DeepSeek-R1 and the accompanying GRPO algorithm (Shao et al., 2024) address this directly by replacing the critic with intra-group advantage normalization, yielding a simpler, more memory-efficient algorithm that achieves state-of-the-art performance on mathematical reasoning benchmarks.
This post provides a rigorous treatment of GRPO: its derivation from first principles, its relationship to prior policy gradient methods, and its empirical behavior on reasoning tasks. We also examine the broader implications for the field — including what it means for a language model to “discover” reasoning strategies through reinforcement alone.
2. Related Work
Schulman et al. (2017) introduced Proximal Policy Optimization, which stabilizes policy gradient training through a clipped surrogate objective. PPO has become the standard algorithm in RLHF pipelines for LLMs (Ouyang et al., 2022), but its reliance on a value network is a significant engineering burden at scale.
Ouyang et al. (2022) — InstructGPT — demonstrated that RLHF with PPO dramatically improves the alignment and instruction-following capabilities of GPT-3. This work established the modern RLHF pipeline: supervised fine-tuning → reward model training → RL fine-tuning. However, InstructGPT relied on human-preference reward models, which are susceptible to reward hacking and do not provide verifiable correctness signals.
Lightman et al. (2023) explored process reward models (PRMs) as an alternative to outcome reward models, providing step-level feedback for mathematical reasoning. PRMs substantially improve performance on MATH and GSM8K benchmarks, but they require annotated intermediate steps — a costly and potentially inconsistent annotation process.
Wei et al. (2022) established chain-of-thought prompting as a mechanism for eliciting multi-step reasoning in LLMs. Their work showed that generating reasoning traces improves accuracy on arithmetic, commonsense, and symbolic reasoning tasks. GRPO can be viewed as an RL method for learning to generate effective reasoning traces rather than imitating exemplar chains.
Shao et al. (2024) — the DeepSeekMath and GRPO paper — introduced GRPO as a variant of PPO that estimates baselines from group-level outputs, eliminating the need for a critic model. Applied to mathematical reasoning with verifiable rewards, GRPO achieves 51.7% on the MATH benchmark with a 7B parameter model, competitive with much larger systems.
Ziegler et al. (2019) provided foundational analysis of fine-tuning language models from human preferences, establishing the KL-regularized reward objective that GRPO (and PPO-based RLHF) inherits. Their framework motivates the KL penalty against the reference policy used in GRPO to prevent distribution collapse.
3. Technical Analysis
3.1 Policy Gradient Foundations
Let $\pi_\theta$ denote the policy parameterized by $\theta$, and let $q$ denote a question sampled from the training distribution. The policy generates a response $o = (o_1, o_2, \ldots, o_T)$ token by token according to: The standard policy gradient objective is to maximize the expected reward: where $r(q, o)$ is the reward signal. The REINFORCE estimator of the gradient is: where $A_t$ is an advantage estimate at step $t$. High variance in $A_t$ is the central challenge; without a good baseline, gradient estimates are noisy and convergence is slow. PPO addresses variance through a learned value function $V_\phi(s_t)$ that estimates the expected future return from state $s_t$. The advantage is estimated as: At LLM scale, maintaining a value network $V_\phi$ with the same parameter count as the policy imposes significant memory overhead — typically requiring two full model copies plus optimizer states. For a 70B model with Adam, this can exceed 1TB of GPU memory for the combined actor-critic setup. GRPO eliminates the critic by sampling a group of $G$ responses $\{o_1, o_2, \ldots, o_G\}$ from the current policy for each question $q$, evaluating their rewards $\{r_1, r_2, \ldots, r_G\}$, and computing advantages relative to the group mean: This is a simple but elegant idea: the group of sampled responses for the same question provides a natural baseline for credit assignment. A response that achieves a reward higher than the group average receives a positive advantage; one that falls below receives a negative advantage. No learned value function is needed — the baseline emerges from the empirical distribution of the policy’s own outputs. The GRPO policy loss is then: where $\pi_{\text{ref}}$ is the reference policy (typically the SFT model) and $\beta$ controls the strength of the KL penalty. The KL term prevents the policy from drifting too far from the reference, which would risk incoherent or degenerate outputs. GRPO’s effectiveness on mathematical reasoning is largely attributable to the quality of the reward signal. For a mathematical problem with a unique numerical answer, the reward function is: This binary outcome reward is sparse but unambiguous. In practice, DeepSeek augments it with a format reward that checks whether the model correctly uses structured delimiters (e.g., placing the final answer in a \texttt{\\boxed{}} environment). The total reward is: Critically, neither reward component requires a neural network to evaluate. This eliminates reward hacking against a learned reward model — a well-documented failure mode in RLHF (Skalse et al., 2022) — because the reward function is not differentiable through and cannot be gamed by finding adversarial inputs that score highly on the reward model while being semantically incorrect. The variance of the GRPO advantage estimator can be analyzed as follows. For a fixed question $q$, the group rewards $r_1, \ldots, r_G$ are i.i.d. draws from some distribution with mean $\mu$ and variance $\sigma^2$. The normalized advantage for response $i$ is: As $G \to \infty$, $\bar{r} \to \mu$ and $s \to \sigma$, so $\hat{A}_i \to \frac{r_i – \mu}{\sigma}$ — the standardized reward. The variance of this estimator decreases as $O(1/G)$ for the baseline term, making larger group sizes beneficial but with diminishing returns. In practice, $G = 8$ to $G = 64$ is used, balancing sample efficiency against computational cost. A striking observation from DeepSeek-R1 training is that GRPO induces reasoning behaviors that were not explicitly supervised. Models trained with GRPO spontaneously develop: These behaviors emerge because they correlate with higher reward: a model that self-verifies is less likely to submit an incorrect answer. GRPO directly optimizes for this correlation without any explicit supervision of the reasoning process itself. This is a remarkable illustration of Sutton’s “bitter lesson”: given sufficient search and a good reward signal, learned strategies outperform hand-engineered ones. The KL penalty $\beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ plays a subtle but important role. Without it, the policy may converge to degenerate solutions — for example, outputting only a boxed answer without any reasoning trace, or collapsing to a single high-reward format that fails to generalize. The reference policy provides a regularization anchor that preserves linguistic coherence and format diversity. However, the appropriate value of $\beta$ is sensitive. Too large a penalty and the RL signal is dominated by the KL term, preventing meaningful policy improvement. Too small and the policy drifts toward reward exploitation. DeepSeek reports using $\beta = 0.04$ in their experiments, with a scheduled warmup. This hyperparameter is likely to require careful tuning in new applications. GRPO’s dependence on verifiable rewards limits its direct applicability to domains with clear correctness criteria. Mathematical problems and code have this property; open-ended writing, summarization, and dialogue generally do not. Extensions to these domains would require either learned reward models (reintroducing reward hacking risk) or process-level verifiers (requiring intermediate annotations). Even within mathematics, reward sparsity is a challenge. For highly difficult problems, the probability of any sampled response being correct may be near zero, yielding near-zero advantage estimates and negligible gradient signal. Curriculum strategies — beginning training with easier problems and increasing difficulty as the policy improves — are essential for practical GRPO training. A natural question is whether GRPO provides benefits beyond SFT on high-quality reasoning traces. Empirically, the answer appears to be yes, particularly on distribution shifts: models trained with GRPO tend to generalize better to novel problem structures, possibly because RL training explores a broader distribution of solution paths. SFT training may cause models to overfit to the specific reasoning styles present in the training data without developing the underlying problem-solving strategy. However, the comparison is complicated by the fact that effective GRPO training typically requires an SFT-initialized model. Starting from a pretrained-only model, the initial policy rarely produces correct solutions, making the reward signal too sparse to drive learning. SFT and GRPO are therefore complementary: SFT establishes a starting distribution with reasonable format and some problem-solving capability; GRPO then fine-tunes this distribution toward correctness. An important empirical finding from DeepSeek-R1 is that GRPO benefits scale with both model size and training compute. Larger models start from better initial policies (more problems in the nonzero-reward regime) and learn more efficiently. This creates a feedback loop: the benefits of RL training are largest precisely for models that are already strong, potentially widening the capability gap between frontier models. There are also intriguing interactions with test-time compute. GRPO-trained models, by virtue of their extended reasoning behavior, effectively perform a form of implicit best-of-N sampling within a single generation. This means that the effective compute budget at inference scales with the reasoning trace length, which GRPO encourages the model to calibrate to problem difficulty. Group Relative Policy Optimization represents a principled and practically impactful advance in training large language models through reinforcement learning. By replacing the critic network with intra-group advantage normalization, GRPO dramatically reduces the engineering complexity of RL training while maintaining competitive gradient quality. Applied to verifiable reward signals from mathematical and code domains, GRPO produces models that exhibit emergent reasoning behaviors — self-verification, backtracking, extended computation — that were not explicitly supervised. The theoretical picture that emerges is compelling: GRPO works because verifiable rewards provide an unbiased, non-hackable training signal, and group normalization provides a low-variance baseline that is free, accurate, and automatically calibrated to the current policy’s performance distribution. These properties together create a stable and efficient RL training regime. Looking forward, the key open problems are: (1) extending GRPO to domains without verifiable rewards without reintroducing reward hacking risk; (2) addressing reward sparsity for hard problems through curriculum design and hierarchical reward shaping; and (3) understanding the theoretical guarantees of GRPO convergence in the non-tabular, high-dimensional setting of LLM token spaces. The success of DeepSeek-R1 and related systems suggests that reward-driven self-improvement may be one of the most promising paths toward reliably reasoning language models — and GRPO is currently the cleanest algorithm for realizing it.
$$\pi_\theta(o | q) = \prod_{t=1}^{T} \pi_\theta(o_t | q, o_{
$$J(\theta) = \mathbb{E}_{q \sim \mathcal{D},\, o \sim \pi_\theta(\cdot|q)}\left[r(q, o)\right]$$
$$\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t=1}^T \nabla_\theta \log \pi_\theta(o_t | q, o_{3.2 PPO and the Critic Bottleneck
$$A_t^{\text{PPO}} = r_t + \gamma V_\phi(s_{t+1}) – V_\phi(s_t)$$
and the policy is updated by maximizing the clipped surrogate objective:
$$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(\rho_t A_t, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) A_t\right)\right]$$
where $\rho_t = \frac{\pi_\theta(o_t|q,o_{3.3 GRPO: Group Relative Advantage Estimation
$$\hat{A}_i = \frac{r_i – \text{mean}(\{r_1,\ldots,r_G\})}{\text{std}(\{r_1,\ldots,r_G\})}$$
$$\mathcal{L}^{\text{GRPO}}(\theta) = -\frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left[\min\left(\rho_{i,t} \hat{A}_i,\, \text{clip}(\rho_{i,t}, 1-\epsilon, 1+\epsilon)\hat{A}_i\right) – \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$3.4 Verifiable Rewards vs. Learned Reward Models
$$r(q, o) = \begin{cases} 1 & \text{if the answer extracted from } o \text{ equals the ground truth} \\ 0 & \text{otherwise} \end{cases}$$
$$r_{\text{total}} = r_{\text{accuracy}} + \lambda \cdot r_{\text{format}}$$3.5 Variance Analysis and Sample Efficiency
$$\hat{A}_i = \frac{r_i – \bar{r}}{s}$$
where $\bar{r} = \frac{1}{G}\sum_j r_j$ and $s = \sqrt{\frac{1}{G}\sum_j (r_j – \bar{r})^2}$.3.6 Emergence of Reasoning Behaviors
4. Discussion
4.1 The Role of the Reference Policy
4.2 Limitations: Sparse Rewards and Non-Decomposable Correctness
4.3 Comparison with SFT on Reasoning Traces
4.4 Scaling Properties
5. Conclusion
References