Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t

Yaniv’s and Mike’s Daily Deep Learning Paper: 03.06.2025

and

Jun 03, 2025

`Joint work with Yaniv Hassidoff`

Large language models (LLMs) like OpenAI's GPT-o1 have revolutionized reasoning tasks—from solving complex math problems to writing elegant code—thanks to massive computing resources and vast datasets in Reinforcement Learning training. But can smaller, budget-friendly models achieve similar feats? A recent paper titled"Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn’t" by Quy-Anh Dang and Chris Ngo explores exactly this question.

Why This Matters

Big models can reason well but demand huge computational resources, making them costly and impractical for widespread use. Smaller models (around 1 to 2 billion parameters) are cheaper and easier to deploy, but currently lag behind in complex reasoning performance. Dang and Ngo’s goal is ambitious yet practical: boost the reasoning performance of small models, with minimal resources and minimal cost.

The Approach: Group Relative Policy Optimization (GRPO)

The researchers chose a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, and fine-tuned it using GRPO, the current go-to algorithm for reasoning fine-tuning previously proven effective in massive models, to a much smaller scale.

To keep costs low, the training was tightly constrained in terms of resources: it used only 4 NVIDIA A40 GPUs, was limited to a 24-hour time window, and relied on a modest dataset of just 7,000 carefully selected math questions.

Surprising Results

Despite these extreme constraints, the performance improvements were remarkable:
AMC23 benchmark accuracy jumped from 63% to 80%. AIME24 benchmark scores reached 46.7%, notably beating OpenAI’s powerful o1-preview model (44.6%).

Perhaps most impressively, the entire training run cost only about $42, several orders of magnitude cheaper than typical state-of-the-art methods. Amazing how far reasoning has come in such a short time.

What Exactly Did They Do?

They conducted three insightful experiments:

Experiment 1: Trained with challenging, high-quality math problems. The model rapidly improved, but then quickly degraded due to unstable optimization and language drift.
Experiment 2: Mixed easier problems with difficult ones, achieving higher initial stability and impressive peak performance, though still eventually spiralling towards instability.
Experiment 3: Used a cosine reward to encourage shorter answer length, achieving improved stability and impressive performance. This corresponds to findings in the “DR GRPO” paper, uploaded to arxiv just a few days earlier, diagnosing a bias for longer answers in GRPO.

Limitations and Open Questions

Why Their Model Works So Well: One of the paper’s most surprising findings is how a small model, trained briefly on limited data, achieves such strong reasoning performance. The authors offer little intuition or analysis to explain why this works so well. This leaves open questions about how robust these gains are particularly on tasks beyond the narrow math benchmarks tested.
Language Drift: The multilingual base model eventually produces non-English outputs, causing training instability for all variants.
Domain specificity: Their evaluation was limited strictly to mathematical reasoning trying to find out whether this approach can be applied to broader reasoning tasks like science or coding?

What’s Next?

This study demonstrates that powerful reasoning doesn't have to be expensive. The surprising success of such a small model suggests future work could explore The benefit of GRPO variants like DR GRPO in resource constrained settings, as well as broader domains: Evaluating performance across diverse reasoning tasks to see if small models can become universally capable. Hyperparameters tuning also seems like a necessary next step, as higher KL divergence loss could improve training stability.

Bottom Line

Dang and Ngo’s paper reveals a promising pathway toward affordable, reasoning-capable language models. While the "why" remains elusive, the practical implications are profound: democratizing powerful AI reasoning to smaller research labs and organizations everywhere.

Read the full paper here →

Explore their code, it holds significant promise for enabling reasoning within your specific domain with minimal investment.

A guest post by

Yaniv Hassidoff

PhD Student, Robotics and AI

Mathy AI Substack

Discussion about this post