DeepSeek: Are we at the brink of a reasoning training revolution for LLMs?

A short summary of an excellent Nathan Lambert's post: https://substack.com/@natolambert/p-155269286

Jan 24, 2025

DeepSeek AI: Pioneering Open-Weight Reasoning Models

On January 20, DeepSeek AI made a significant announcement by releasing its first fully open-source reasoning model, marking a turning point in reasoning model research. This release includes:

R1: The Flagship Model
R1 is a reasoning language model trained through a 4-stage process emphasizing reinforcement learning (RL). It is open-source, allowing researchers and companies to build upon it.
R1-Zero: A Foundation for R1
R1-Zero is a purely RL-trained reasoning model based on DeepSeek’s V3 base model. It generates training data for R1 and provides a baseline for reinforcement learning research.
Fine-Tuned Models
DeepSeek also released a suite of smaller, fine-tuned models using supervised data derived from R1. These models provide practical options for researchers who may not need the full capability of R1.

The models are available on DeepSeek’s platform, chat.deepseek.com, and through their app. DeepSeek-V3, a Mixture-of-Experts (MoE) LM with 671B total parameters with 37B activated for each token, introduced in last December, was leveraged to train R1 models.

R1: A Fresh Approach for Training Reasoning Models

The release of R1 marks an important milestone in reasoning model research. Unlike pre-training and post-training models, which were shaped by landmark works such as GPT-2 and InstructGPT, reasoning models have long lacked a defining framework. DeepSeek’s R1 bridges this gap, establishing a concrete methodology that promises to accelerate advancements in reasoning models throughout 2025 and beyond.

One of the most striking aspects of this development is the cost structure of reasoning models. DeepSeek’s R1 is priced at a fraction of the cost of comparable models like OpenAI’s o1, with R1 offering a rate of $0.55 per million input tokens compared to o1’s $15. This pricing shift lowers the barrier to entry, making advanced reasoning models more accessible to a wider audience.

Equally significant is the release of R1 under a permissive open-source license, marking a return to community-driven innovation in AI. For the first time since Stable Diffusion, a groundbreaking AI model with immense potential has been openly shared, fostering broader adoption, experimentation, and the potential for rapid advancements across the field.

The R1 Training: A 4-Stage Approach

DeepSeek’s R1 was developed using a carefully designed four-stage training process:

Supervised Fine-Tuning (SFT) on R1-Zero Data
R1’s journey begins with supervised fine-tuning on synthetic reasoning data generated by R1-Zero. This so-called “cold start” stage helps establish a foundation for reasoning abilities while addressing the quirks of R1-Zero, such as language switching during reasoning tasks.
Large-Scale Reinforcement Learning
The second stage involves extensive RL training focused on reasoning problems. This stage iteratively improves the model’s accuracy and reasoning depth by rewarding correct answers and penalizing errors.
Rejection Sampling for General Capabilities
The third stage introduces rejection sampling, a technique that filters and selects high-quality responses to fine-tune the model further. This stage incorporates a mix of reasoning and general chat data, broadening the model’s capabilities.
Final RL Fine-Tuning for General Use
The final stage refines the model’s helpfulness, harmlessness, and reasoning capabilities through a mix of verifiable reasoning prompts and standard RLHF (Reinforcement Learning with Human Feedback) prompts. This step ensures the model is user-friendly while maintaining its core reasoning strengths.

Insights and Open Questions from R1’s Training

DeepSeek’s report offers valuable insights into the nuances of RL training for reasoning models. Some highlights include:

Scaling RL Training:
The extensive RL training process, measured in thousands of steps(= model updates), demonstrates the scalability of reinforcement learning for reasoning tasks. However, the optimal balance between training data and reward signals remains an open question.
Reward Structure:
DeepSeek employed a mix of rewards to train R1, including accuracy rewards, format rewards, and language consistency rewards. These rewards ensured the model adhered to specific formatting and language requirements while prioritizing correct answers.
Base Model Selection:
The importance of a strong base model with long-context capabilities was emphasized. While R1-Zero was instrumental in initializing R1, its limitations (e.g. not coherent text jumping between different languages) highlighted the need for robust base models in RL training.
Rejection Sampling and Distillation:
DeepSeek’s use of rejection sampling to introduce general capabilities is a promising approach, but the details of the underlying datasets and reward models remain sparse. The distillation of reasoning traces from R1 into smaller models also raises questions about the scalability of reasoning capabilities in smaller architectures.

Implications for the Future of Reasoning Models

DeepSeek’s work underscores the growing importance of reasoning models in AI research. As the field matures, several key trends and questions emerge:

Open-Source Collaboration:
The release of R1 with open weights and detailed documentation sets a precedent for transparency and collaboration in reasoning research. This openness is expected to accelerate advancements in the field.
Economics of AI Models:
The pricing of R1 highlights the potential for affordable reasoning models to democratize access to advanced AI capabilities. However, it also raises questions about the sustainability of such pricing in the long term.
Applications of Reasoning Models:
While R1 excels in code and math tasks, its broader applications remain uncertain. Future research will likely explore its utility in diverse domains, from scientific discovery to creative writing.
Scaling Laws and Model Size:
The relationship between model size and reasoning capabilities remains a critical area of inquiry. While larger models tend to exhibit better reasoning abilities, the possibility of achieving similar results with smaller, fine-tuned models is an exciting avenue for exploration.

Conclusion: A New Era for Reasoning AI

The release of DeepSeek R1 marks a pivotal moment in AI research, providing a roadmap for reasoning model development and highlighting the potential of open-source innovation. By combining advanced RL techniques with transparency and affordability, DeepSeek has set a new standard for the industry. As researchers and developers build on this foundation, the coming years promise unprecedented progress in reasoning AI, unlocking new possibilities for applications across disciplines.

Mike’s Substack

Discussion about this post