Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
Deep Learning Paper Review, Mike's Daily Paper - 24.01.25
Short Paper Summary:
The paper revisits the use of Reinforcement Learning from Human Feedback (RLHF) in the optimization of Large Language Models (LLMs). It challenges the dominance of Proximal Policy Optimization (PPO) as the de facto reinforcement learning(RL) method in this context, highlighting its computational inefficiency and unnecessary complexity. Instead, the authors propose returning to simpler REINFORCE-style methods, specifically Vanilla Policy Gradient (REINFORCE) and its multi-sample extension, REINFORCE Leave-One-Out (RLOO). These methods are shown to outperform PPO in terms of computational cost, sample efficiency, and reward optimization across multiple datasets and LLM architectures. The findings emphasize that aligning LLMs with human preferences can be achieved with more straightforward optimization strategies tailored to the specifics of RLHF.
Important points:
Theoretical Simplification:
The authors demonstrate that many components of PPO (e.g., clipping, value functions, and token-level modeling) are unnecessary for RLHF, given the stable initialization(aka warm start) of pre-trained LLMs. By modeling entire sequences as single actions, REINFORCE avoids the complexity of token-level state-value functions, making the problem akin to a contextual bandit.
Practical Efficiency:
RLOO uses all generated samples for baseline construction, achieving higher sample efficiency than RAFT, which discards all but the top-ranked sample. This leads to significant computational savings and better utilization of available data. The approach simplifies RLHF pipelines by reducing reliance on sensitive hyperparameters like clipping ratios and advantage estimation.
Robustness:
RLOO demonstrates robustness to noisy reward signals and higher KL penalties, outperforming methods like RAFT that are more sensitive to them.
Theoretical Insights
Variance-Bias Tradeoff and Unbiased Gradient Estimation:
PPO relies on state-value functions and Generalized Advantage Estimation (GAE) to reduce variance in gradient estimation at the cost of introducing bias. The paper argues that in RLHF, the strong initialization(=warm start) of LLMs makes variance reduction less critical. This enables unbiased methods like REINFORCE to perform well without introducing bias. Empirically, the paper demonstrates that REINFORCE achieves better reward optimization than PPO, even when subjected to theoretically high-variance conditions.
Full-Trajectory Modeling vs. Token-Level Modeling:
PPO models each token as an action, creating a Markov Decision Process (MDP) where partial sequences are states. However, RLHF attributes rewards only to entire sequences, rendering intermediate states irrelevant. By modeling the entire sequence as a single action, REINFORCE simplifies the problem into a bandit-like setup, aligning directly with the reward structure. Empirical results confirm that this approach outperforms token-level modeling in both efficiency and performance.
Clipping and Stability of Policy Updates:
PPO uses a clipping mechanism to prevent large policy updates that could destabilize learning. The authors show that this is unnecessary for RLHF, as the optimization landscape is stable due to the strong initialization of pre-trained LLMs. Removing clipping in PPO or avoiding it altogether with REINFORCE results in better performance, indicating that RLHF does not require this level of stabilization.
Tradeoff Between Variance Reduction and Bias Introduction:
PPO’s advantage estimator trades off variance and bias, controlled by the hyperparameter λ. Higher λ\lambda values (closer to 1) reduce bias but increase variance. The authors demonstrate that in RLHF, higher λ\lambda values consistently lead to better rewards, supporting the use of unbiased estimators like REINFORCE.
Robustness of Multi-Sample Baseline Estimators:
RLOO uses multiple generated samples to construct baselines for variance reduction. This approach maintains the unbiased nature of REINFORCE while significantly improving sample efficiency and robustness to noise. Unlike RAFT, which discards lower-ranked samples, RLOO leverages all generated samples, leading to consistent improvements across datasets and conditions.
KL Regularization and Alignment:
KL divergence penalizes deviations from the reference policy, helping ensure alignment. The authors show that RLOO handles high KL penalties better than RAFT, which struggles due to its reliance on top-ranked samples. RLOO achieves a balance between alignment (reward optimization) and diversity, avoiding excessive overfitting to the reward model.
Simplification of RLHF Objectives:
By eliminating components like clipping, token-level modeling, and GAE, the REINFORCE framework reduces the number of hyperparameters, simplifying the RLHF pipeline. This makes it more accessible to non-RL specialists while maintaining or improving performance.
Limitations and Future Directions:
Reward Over-Optimization:
The study does not address reward model over-optimization, where the policy exploits biases in the reward function at the expense of generalization. This remains an open challenge for RLHF.
Human Evaluation:
While simulated win-rates using GPT-4 serve as a proxy for human preferences, direct human evaluations would provide stronger evidence for the alignment quality.
Scalability:
The scalability of REINFORCE and RLOO to larger models and diverse datasets, particularly under resource constraints, warrants further investigation.
Conclusion
The paper presents a compelling argument for revisiting REINFORCE-style methods in RLHF, challenging the dominance of PPO. By leveraging the specific characteristics of RLHF—such as stable pre-trained initialization and sequence-level rewards—the authors demonstrate that simpler methods like REINFORCE and RLOO can outperform more complex alternatives like PPO and RAFT in terms of reward optimization, sample efficiency, and robustness.
https://arxiv.org/abs/2402.14740