S1: Simple Test-Time Scaling: Is it possible to beat OpenAI o1 with 1K Samples?
Deep Learning Paper Review
A research paper from Stanford University titled "S1: Simple Test-Time Scaling” explores a new method for improving the reasoning abilities of large language models (LLMs) through test-time scaling.
The paper aims to create a straightforward approach to test-time scaling that rivals the performance of OpenAI's closed-source model, o1. The S1 approach relies on two key ingredients. First, the researchers curated a thousand-sample dataset called S1K. Second, they developed a novel test-time scaling method.
The creation of the S1K dataset involved several stages. Initially, the researchers collected 59,000 samples from various sources, primarily focused on mathematical domains, including subdomains like Numina Math, AIME, and Olympiad Arena, which covers science problems like biology, chemistry and physics. They also incorporated probability questions from Stanford exams and quantitative trading interview questions. These samples included questions, reasoning traces, and solutions generated using Google's Gemini flash thinking API.
The initial dataset underwent a three-stage filtering process. The first stage focused on quality, removing samples with formatting issues, reducing the set to 54,000 samples. The second, more interesting stage filtered by difficulty. Each sample was evaluated by two LLMs (Qwen2.5-7B instruct and Qwen 2.5-32B instruct) and a third model (Claud 3.5 Sonnet) that assessed the correctness of the responses. Only questions that both LLMs answered incorrectly were retained, ensuring that the selected samples were challenging. This stage resulted in approximately 25,000 samples.
The third stage concentrated on diversity. The samples were clustered into categories using an LLM, covering various math topics and other scientific domains. The researchers then sampled a similar number of samples from each domain, prioritizing samples with longer reasoning traces, as these were considered more complex. This process culminated in the final S1K dataset, containing 1,000 samples with questions, answers, and reasoning traces.
The S1 model was then created by fine-tuning the Qwen2.5-32B instruct model on the S1K dataset using supervised fine-tuning. This approach aligns with the "less is more for alignment" hypothesis, which suggests that a model's core knowledge resides in its pre-training, and a small, high-quality dataset is enough to activate and refine its reasoning abilities.
The second key ingredient of the S1 approach is the novel test-time scaling method, which the researchers call "budget forcing." This method controls the amount of reasoning the model performs during the test phase by setting limits on the number of "thinking tokens" the model can generate. To enforce a maximum number of tokens, the model is forced to exit early by appending an "end of thinking" token and the phrase "final answer." To enforce a minimum, if the model is about to generate the "end of thinking" token, it is prevented from doing so. Instead, a "weight token" is added to the reasoning trace, encouraging the model to reflect further before answering.
The results demonstrate the S1 model's effectiveness. It achieves impressive accuracy on the Math 500 dataset while being significantly more sample-efficient than other models. For instance, R1 Distill, which uses the same base model, achieves better results but requires 800 times more samples. The budget forcing method also shows a clear scaling trend: as the test-time compute (thinking time) increases, so does the accuracy. However, excessively preventing the model from finishing can lead to repetitive loops instead of continued reasoning. Additionally, the context window of the underlying LLM limits how much the thinking time can be extended.
All-in-all, data quality rules the world!