Finetuning is a Memory Wipe. This Is How You Stop It

Scaling Laws for Forgetting When Fine-Tuning Large Language Models, Mike’s Daily Paper: 19.08.25

Aug 22, 2025

The 1% Rule for Curing AI Amnesia: A Deep Dive

If you've ever finetuned a powerful language model, you know the painful tradeoff. You specialize in a new task, and in the process, it develops a form of amnesia, forgetting the general knowledge it was so expensive to acquire. This "catastrophic forgetting" is a fundamental challenge. A common remedy is to mix in a small amount of the original pretraining data during finetuning, but this has always felt more like a folk remedy than a science.

A paper, "Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection," elevates this trick into a predictable science. The authors go far beyond just saying "data injection helps." They present a precise, predictive model that describes the complex dance between model size, the amount of fine-tuning data, and the percentage of injected data. While the headline is that a mere 1% injection can halt forgetting, the paper's true novelty is the underlying mathematical framework that explains it all.

The Novelty: A Predictive Model for Forgetting

The core innovation is a new scaling law designed to predict the final pretraining loss, which is a direct proxy for how much the model has forgotten. Instead of a simple formula, think of it as a relationship between competing forces.

The model is elegantly structured. It starts with a baseline: the model's initial pretraining loss before finetuning even begins. It then adds a second term that calculates the magnitude of the forgetting that will occur. This forgetting term is a fraction, with factors that worsen forgetting in the numerator and factors that prevent it in the denominator.

What makes forgetting worse? In the numerator, we find a term representing the amount of unique finetuning data. This reveals a fascinating insight: the more you finetune a model on new data, the more it forgets its old knowledge. This is because more training steps cause the model's parameters to drift further away from their original, generalist state.
What fights forgetting? In the denominator, we find the mitigating factors. The first is the model's size (its parameter count). This confirms the intuition that larger models have more capacity to learn new information without overwriting existing knowledge.
The "Magic" Ingredient: Here’s the most clever part of the model. The injection of pretraining data is modeled as a powerful multiplier on the model's effective size. When the model sees even a small percentage of pretraining data, it behaves as if it has a much larger parameter count for the purpose of remembering its original training. A special coefficient, which the paper calls "Parameter Relative Efficiency" (B), determines just how potent this effect is. For a domain that is very different from the pretraining data (like mathematics), this efficiency coefficient is enormous, signifying that injection is critically important. For a similar domain (like Wikipedia), the coefficient is much smaller, as the model is less prone to forgetting in the first place.

This model isn't just theoretical; it's incredibly accurate. Across 12 different domains, it predicts the final pretraining loss with a mean error of just 0.49%.

Key Insights Beyond the 1% Rule

This powerful model for forgetting yields several other novel and highly practical insights.

1. Finetuning Performance is Safe

A natural fear is that mixing in old data will hurt the model's performance on the new task. The authors show this is not the case. The final validation loss on the finetuning task is barely affected by injecting a small amount of pretraining data. In fact, for smaller models, the injection acts as a healthy regularizer, preventing overfitting and sometimes leading to even better performance on the target domain.

2. Extrapolation is a Superpower

The true value of a scaling law is predicting the future. The authors confirm that their model is excellent for extrapolation. By running cheap experiments on smaller models (e.g., a 334M parameter model), they could accurately predict the forgetting and fine-tuning performance of much larger, more expensive models (1.3B+ parameters). This allows labs to forecast the results of a 7-hour run on 8 GPUs using a 30-minute experiment on 4 GPUs, saving immense amounts of time and energy.

3. You Don't Need the Whole Haystack

Practically speaking, does this technique require streaming from a petabyte-scale pretraining dataset? The answer is no. A key experiment shows that a surprisingly small pool of unique pretraining tokens is sufficient for the injection to be effective. This makes the method far more accessible and easier to implement than one might assume.

https://arxiv.org/abs/2401.05605

Mathy AI Substack

Discussion about this post