Fine-Tuning Methods for LLMs(SFT and RL): Explanations, Objectives and Gradients

Based on Appendix A in the GRPO paper: https://arxiv.org/pdf/2402.03300

Apr 25, 2025

Sometimes a scientific paper appendix is just an appendix - but sometimes it is really important. The GRPO paper (DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models) belongs to the latter case.

In recent years, fine-tuning LLMs via reinforcement learning(RL) and reward modeling has become the cornerstone of aligning models with human preferences and external rules. While discussions often focus on algorithmic innovation or empirical performance, this post takes a closer look at the gradient structure—particularly the data source, reward signal, and the gradient coefficient (factoring gradient of the policy log) across various alignment strategies. We begin with the simplest baseline and build up to increasingly RL-heavy methods.

Supervised Fine-tuning (SFT)

Supervised Fine-tuning is the canonical pre-RL phase. Its objective is to maximize the log-likelihood of reference completions conditioned on prompts:

\(J_{\text{SFT}}(\theta) = \mathbb{E}_{(q, o) \sim \mathcal{D}_{\text{sft}}} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \log \pi_\theta(o_t \mid q, o_{<t}) \right]\)

Its gradient is simply:

\(\nabla_\theta J_{\text{SFT}} = \mathbb{E}_{(q, o) \sim \mathcal{D}_{\text{sft}}} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \nabla_\theta \log \pi_\theta(o_t \mid q, o_{<t}) \right]\)

𝑞, 𝑜 are questions sampled from the question dataset and outputs respectively.
Data Source: Human-annotated pairs from the supervised dataset D_{SFT}.
Reward Function: Implicit—defined by reference completions.
Gradient Coefficient: Always 1 (i.e., uniform weighting).

Rejection Sampling Fine-tuning (RFT)

RFT introduces a basic but powerful feedback loop in the reinforcement learning paradigm. The key idea is to sample multiple responses from the SFT model, evaluate each against a set of correctness criteria, and then fine-tune the model only on those responses that meet the criteria. This selective process ensures that the model learns from high-quality, human-validated outputs, optimizing the training process without introducing noise from less effective completions.

Workflow:

Sampling: Multiple candidate responses are generated from the SFT model for each query.
Selection: The sampled completions are then assessed based on predefined criteria, such as factual accuracy, coherence, or other task-specific measures.
Fine-Tuning: Only the responses that satisfy the correctness criteria are used to fine-tune the model, enhancing its performance on relevant tasks.

The simplicity of RFT contrasts with more complex reinforcement learning setups by focusing directly on rewarding correctness rather than continuously exploring new policies. This makes it a more data-efficient method while ensuring high-quality training updates.

Objective:

\(J_{\text{RFT}}(\theta) = \mathbb{E}_{q \sim \mathcal{D}_{\text{sft}}, o \sim \pi_{\text{sft}}(\cdot|q)} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \mathbb{I}(o) \log \pi_\theta(o_t \mid q, o_{<t}) \right]\)

Gradient:

\(\nabla_\theta J_{\text{RFT}} = \mathbb{E}_{q \sim \mathcal{D}_{\text{sft}}, o \sim \pi_{\text{sft}}(\cdot|q)} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \mathbb{I}(o) \nabla_\theta \log \pi_\theta(o_t \mid q, o_{<t}) \right]\)

where:

\(\mathbb{I}(o) = \begin{cases} 1 & \text{if } o \text{ is deemed correct} \\ 0 & \text{otherwise} \end{cases}\)

Data Source: Prompts from SFT dataset; completions sampled from the SFT model.
Reward Function: Binary filter based on correctness.
Gradient Coefficient: I(o)

RFT is the first step where the reward becomes explicit and selective training is introduced.

Online Rejection Sampling Fine-tuning (Online RFT)

Online Rejection Sampling Fine-Tuning (Online RFT) extends the idea of Rejection Sampling Fine-Tuning (RFT) but introduces a key modification: instead of using a frozen SFT policy, Online RFT dynamically samples outputs from the current policy π_θ during training. This change introduces on-policy sampling, creating a tighter, more adaptive feedback loop where the model continuously learns from its own outputs.

Workflow:

Sampling: Just like RFT, multiple candidate responses are generated for each query. However, instead of relying on a frozen SFT model, Online RFT uses the current policy model π_θ to generate responses in real-time.
Selection: The sampled responses are evaluated using a set of correctness criteria, such as human-defined rules or domain-specific validation checks.
Fine-Tuning: Only the responses that meet the correctness criteria are used to fine-tune the policy, creating a direct, real-time learning loop based on the policy's current performance.

Gradient:

\(\nabla_\theta J_{\text{OnRFT}} = \mathbb{E}_{q \sim \mathcal{D}_{\text{sft}}, o \sim \pi_\theta(\cdot|q)} \left[ \frac{1}{|o|} \sum_{t=1}^{|o|} \mathbb{I}(o) \nabla_\theta \log \pi_\theta(o_t \mid q, o_{<t}) \right]\)

Data Source: Prompts from SFT, outputs sampled from the current model.
Reward Function: Binary rule-based selection.
Gradient Coefficient: I(o)

This is often treated as a stepping stone toward full RL training (e.g., PPO), yet it already embodies the essential structure: on-policy sampling + binary reward shaping. The gradient coefficient modulates the learning signal. In SFT, it's always one—each token in each completion gets equal treatment. But starting from RFT, we see this uniformity replaced by selective reinforcement based on outcome quality. This selective reinforcement is what enables alignment beyond imitation.

Online RFT represents a pivotal turning point: the model learns to self-correct via real-time sampling, even before full-scale reward modeling or policy optimization.

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) presents a novel approach in reinforcement learning by bypassing the need for an explicit reward model. Traditionally, reinforcement learning relies on a predefined reward function to guide learning. However, in DPO, the model directly learns from human-labeled comparisons between two responses: o+ (preferred) and o− (rejected). Both responses are sampled from the Supervised Fine-Tuning (SFT) model, but their relative preference is determined by humans or a set of domain-specific rules.

How Does DPO Work?

Sampling Responses: For each query qqq, two possible responses are sampled from the SFT model. One response is preferred o+, and the other is rejected o−. These responses are based on the model's previous understanding of the task, and the goal is to learn which response better satisfies the query.
Human-Labeled Comparisons: The core of DPO lies in human labeling of these two responses. A human (or a set of rules) compares the responses, labeling one as preferred and the other as rejected. This preference-based labeling eliminates the need for a traditional reward model, which is often complex and time-consuming to define.
Optimization Objective: The model's objective in DPO is to maximize the likelihood of the preferred response while minimizing the likelihood of the rejected response. This is done by optimizing a log-likelihood loss based on the relative preference between the two sampled responses.

The optimization objective is formulated as a binary logistic regression:

\(J_{\text{DPO}}(\theta) = \newline \mathbb{E}_{q \sim P_{\text{sft}}(Q), \mathbf{o}^+, \mathbf{o}^- \sim \pi_{\text{sft}}} \log \sigma\left( \beta \left[ \frac{1}{|\mathbf{o}^+|} \sum_{t=1}^{|\mathbf{o}^+|} \log \frac{\pi_\theta(o^+_t \mid q, o^+_{<t})}{\pi_{\text{ref}}(o^+_t \mid q, o^+_{<t})} - \frac{1}{|\mathbf{o}^-|} \sum_{t=1}^{|\mathbf{o}^-|} \log \frac{\pi_\theta(o^-_t \mid q, o^-_{<t})}{\pi_{\text{ref}}(o^-_t \mid q, o^-_{<t})} \right] \right)\)

This expression computes a per-token log-ratio between the current policy and a reference model (typically SFT), applying a sigmoid over the aggregate difference in log-likelihood ratios.

Gradient Flow: The gradient splits over positive and negative samples:

\(\nabla_\theta J_{\text{DPO}}(\theta) = \mathbb{E}\left[ \frac{1}{|\mathbf{o}^+|} \sum_t \text{GC}_{\text{DPO}}(q, \mathbf{o}, t) \nabla_\theta \log \pi_\theta(o^+_t) - \frac{1}{|\mathbf{o}^-|} \sum_t \text{GC}_{\text{DPO}}(q, \mathbf{o}, t) \nabla_\theta \log \pi_\theta(o^-_t) \right]\)

Gradient Coefficient:

\(\text{GC}_{\text{DPO}}(q, \mathbf{o}, t) = \sigma\left( \beta \log \frac{\pi_\theta(o^-_t)}{\pi_{\text{ref}}(o^-_t)} - \beta \log \frac{\pi_\theta(o^+_t)}{\pi_{\text{ref}}(o^+_t)} \right)\)

In essence, DPO aligns the current model to favor the preferred sample while penalizing the dispreferred one, guided by a relative likelihood ratio against a static reference.

Proximal Policy Optimization (PPO)

PPO, the workhorse of RLHF, operates by estimating advantages A_t and bounding policy updates through a clipped surrogate objective(note that the regularization terms penalizing to divergence from the reference(e.g. SFT) policy π_ref is dropped for simplicity:

\(J_{\text{PPO}}(\theta) = \mathbb{E}_{q, \mathbf{o} \sim \pi_{\text{old}}} \left[ \frac{1}{|\mathbf{o}|} \sum_{t=1}^{|\mathbf{o}|} \min\left( r_t A_t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon) A_t \right) \right]\)

where

\(r_t = \frac{\pi_\theta(o_t)}{\pi_{\text{old}}(o_t)}.\)

Under a simplifying assumption (single policy update per rollout, whereas in real-life PPO applications policy is updated once every T > 1 rollouts ), i.e., π_old=π_θ. Thus the clipping disappears, yielding:

\(J_{\text{PPO}}(\theta) = \mathbb{E} \left[ \frac{1}{|\mathbf{o}|} \sum_{t=1}^{|\mathbf{o}|} A_t \cdot \frac{\pi_\theta(o_t)}{\pi_{\text{old}}(o_t)} \right]\)

Gradient:

\(\nabla_\theta J_{\text{PPO}}(\theta) = \mathbb{E} \left[ \frac{1}{|\mathbf{o}|} \sum_t A_t \cdot \nabla_\theta \log \pi_\theta(o_t) \right]\)

Reward Model: The policy relies on a learned reward model to compute A_t via Generalized Advantage Estimation (GAE), incorporating a value function V_ψ and downstream rewards r_t.

Group Relative Policy Optimization (GRPO)

roup Relative Policy Optimization (GRPO) extends Proximal Policy Optimization (PPO) by handling multiple trajectories generated for a single query. It incorporates relative entropy regularization, which ensures the current policy does not deviate significantly from a reference model during training. This regularization helps maintain stability and prevents the policy from drifting too far from the reference, even with multiple trajectories.

GRPO uses an averaging process across different trajectories, making the policy updates more stable by reducing variance. It also integrates the advantage function, similar to PPO, to guide policy updates based on the relative performance of actions. The method ensures efficient use of data by considering multiple outputs for each input, improving training efficiency. GRPO is particularly useful in environments with noisy data or where policy stability is essential, such as multi-agent systems or interactive recommendation engines.

The GRPO objective is:

\(J_{\text{GRPO}}(\theta) = \mathbb{E}_{q, \{o_i\}_{i=1}^G \sim \pi_\theta} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \frac{\pi_\theta(o_{i,t})}{\pi_{\text{old}}(o_{i,t})} \hat{A}_{i,t} - \beta \left( \frac{\pi_{\text{ref}}(o_{i,t})}{\pi_\theta(o_{i,t})} - \log \frac{\pi_{\text{ref}}(o_{i,t})}{\pi_\theta(o_{i,t})} - 1 \right) \right) \right]\)

This KL-regularized advantage encourages updates that improve returns while keeping the updated policy close to a reference, implicitly applying information-theoretic trust regions. The regularization term(with β) is a low variance estimate of KL divergence between π_θ and π_ref. Note that the inner summation is over tokens of the output o_t while the outer one is over responses in the batch G.

Gradient of GRPO:

\(\nabla_\theta J_{\text{GRPO}}(\theta) = \mathbb{E}_{q \sim P_s^t(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}} \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left( \hat{A}_{i,t} + \beta \frac{\pi_{\text{ref}}(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta}(o_{i,t} \mid q, o_{i,<t})} - 1 \right) \nabla_\theta \log \pi_{\theta}(o_{i,t} \mid q, o_{i,<t}) \)

Gradient Coefficient:

\(\text{GC}_{\text{GRPO}}(q, \mathbf{o}, t, \pi_{\theta_{\text{rm}}}) = \hat{A}_{i,t} + \beta \frac{\pi_{\text{ref}}(o_{i,t})}{\pi_\theta(o_{i,t})} - 1\)

where

\(\text{where}\ \hat{A}_{i,t} \)

is computed based on the group reward scores, which ensures that GRPO retains stability while allowing for multiple trajectory comparisons in each update.

Summary

This post examines the gradient structures of various LLM alignment strategies, from Supervised Fine-tuning (SFT) through Reinforcement Learning (RL) methods like Rejection Sampling (RFT), Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO).

It analyzes how each technique utilizes different data sources (human annotations, model samples) and reward signals (implicit, binary correctness, preferences, learned rewards). The core comparison highlights how the gradient coefficient evolves from uniform weighting in SFT to selective, reward-based weighting in RL approaches, enabling alignment beyond simple imitation.

This selective reinforcement, guided by advantages or preference likelihoods, is key to optimizing models based on desired outcomes rather than just mimicking reference data.

Mathy AI Substack

Discussion about this post