Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
Mike’s Daily Paper: 07.07.25
LLMs continue to astound us with their problem-solving abilities, yet a nagging question persists: do they genuinely "understand," or are they simply sophisticated parrots of their training data? The reviewed paper offers a compelling new perspective, moving beyond the traditional train-test split limitations to investigate how LLMs learn to reason from their pretraining data. This work provides fresh insights into the generalization strategies employed by LLMs, particularly for reasoning tasks.
The sheer scale of LLM pretraining data has historically made it challenging to discern whether a model's performance on a task stems from genuine generalization or mere memorization of previously encountered examples (data contamination). The paper tackles this by employing influence functions, a technique from statistics, to identify which specific pretraining documents impact a model's output for given queries. Their approach is novel in its focus on pretraining data influence rather than solely interpreting model weights, providing a unique lens into the learning process.
Here's a breakdown of the key novel findings that set this paper apart:
Procedural Knowledge, Not Just Facts, Drives Reasoning: The most significant revelation is that for reasoning tasks (specifically, mathematical problems like arithmetic, calculating slopes, and solving linear equations), the influence of pretraining documents is highly correlated across different queries within the same task. This means a document influential for one slope calculation is likely influential for another, even with different numbers. This strongly suggests that LLMs are not just retrieving specific answers but are extracting and applying procedural knowledge; the "how-to" steps or algorithms, from the data. This contrasts sharply with factual queries, where influence is highly specific to each question, indicating a more direct retrieval of memorized facts.
Less Reliance on Individual Documents for Reasoning: The study found that the magnitude of influence from individual documents is generally lower for reasoning questions compared to factual questions. Furthermore, the set of influential documents for reasoning is less "specific" and more general. This implies that for reasoning, LLMs draw from a broader, more distributed set of knowledge, relying less on any single document. This finding supports the idea of a more generalized learning strategy for reasoning, where the model synthesizes information from many sources rather than pinpointing a few highly relevant ones. The effect is even more pronounced in larger models (35B vs. 7B), suggesting greater data efficiency in generalization.
Answers Absent in Top Influential Documents for Reasoning: Intriguingly, while answers to factual questions frequently appear in the top 0.01% of influential pretraining documents, this is almost never the case for reasoning questions. Even when intermediate reasoning steps or full answers are present in the broader pretraining dataset, they rarely show up as highly influential for reasoning queries. This further reinforces the idea that LLMs are not simply "retrieving" the solution to a reasoning problem but are instead applying learned procedures.
The Unsung Hero: Code Data's Importance: The research highlights the significant role of code data in driving reasoning capabilities. Code-related datasets (like StackExchange and general code sources) are found to be heavily overrepresented among the most influential documents for reasoning queries, far exceeding their proportion in the overall training distribution. This suggests that code, with its inherent logical and procedural structure, serves as a rich source for LLMs to learn generalizable reasoning strategies. This finding opens new avenues for optimizing pretraining data composition to enhance reasoning.
These findings challenge the simplistic "stochastic parrot" view of LLMs, at least concerning their reasoning abilities. Instead of merely regurgitating information, the models appear to learn abstract procedures and apply them to novel problems. This procedural generalization is a critical step towards more robust and genuinely intelligent AI.
The implications for future LLM development are profound:
Data Curation Focus: Instead of trying to cover every possible instance of a problem, pretraining strategies could focus on high-quality data that explicitly demonstrates procedures and problem-solving methodologies across diverse reasoning tasks.
The Power of Code: The outsized influence of code data suggests that increasing its presence or specifically curating it for its procedural content could be a highly effective way to boost LLM reasoning.
Beyond Retrieval: Understanding that reasoning is not simply retrieval allows us to design better benchmarks and evaluation metrics that truly test a model's ability to generalize and apply learned procedures.
While the study acknowledges limitations (e.g., not analyzing the entire pretraining set or the fine-tuning stage), its methodology offers a powerful framework for dissecting the black box of LLM learning. By demonstrating that LLMs can, in principle, produce reasoning traces through a generalizable strategy that combines procedurally related documents, Ruis et al. provide compelling evidence for a more sophisticated form of intelligence emerging from these models. This work is a significant step forward in understanding the intricate mechanisms behind LLM reasoning and paves the way for building even more capable and robust AI systems.
https://arxiv.org/abs/2411.12580


