This is one of the strongest and most profound papers I’ve read recently. And no — it didn’t train a model that achieved state-of-the-art results across benchmarks, nor did it propose a new architecture or training method. What the authors set out to do is explain a phenomenon called grokking through the lens of compression; both compression of data and compression of models.
This topic is a bit dense, so I’ll try to explain it step-by-step in a simplified way.
What Is Grokking?
Grokking is a phenomenon that happens during neural network training when we continue training well after reaching the validation loss minimum. At first, as expected, the model enters an overfitting phase, and validation loss increases. But then something strange happens: At a certain point, validation loss begins to decrease again, suggesting the model is transitioning from memorization to generalization. In simple terms, the model actually "gets" the problem.
This typically occurs in overparameterized models, where the number of trainable parameters is much larger than what is theoretically needed to fit the dataset (There’s a more precise mathematical explanation, but it involves nontrivial complexity theory and isn't needed for this overview). Grokking is related to the double descent phenomenon and also to the lottery ticket hypothesis line of research. Interestingly, if you continue training, the validation loss keeps dropping, it doesn't stop, meaning the loss converges toward zero.
So What Does This Have to Do With Compression?
To explain this, we need to introduce two key concepts: The first is called the Minimum Description Length (MDL) principle. This principle says that if we want to optimally compress a dataset using a model, we need to minimize the sum of two terms:
The entropy of the dataset after being passed through the model, and
The complexity of the model itself.
This idea builds directly on Shannon’s coding theorem, which says that the lower the entropy of the data, the more effectively it can be compressed. Now, estimating the entropy of a dataset given a model is something we roughly know how to do. For classification tasks, for instance, you can simply use cross-entropy loss.
But estimating the complexity of a model is harder. We first need to define something called Kolmogorov Complexity (KC). The KC of a data string ddd is the length of the shortest computer program that outputs d. For example, a line of repeated 1s has low KC (it’s easy to describe), while a random sequence of 0s and 1s has high KC (it takes nearly as much space to describe as the data itself). Importantly, KC is not computable in general.
Enter Rate–Distortion Theory
Another important concept is the rate–distortion function r, which given some input x and distortion tolerance ε defines the minimum number of bits (or KC) required to describe some output y that is within ε distortion of x. Of course, the notion of “distortion” depends on the chosen distance metric. In the paper’s context x and y are models (represented by their weights) while x is a fully(regularly) trained model M and is: a coarse-grained version of M, denoted by CS.
A coarse-grained model is a simplified version of M, for example, through: Quantization, Pruning or Replacing weight matrices with low-rank approximations. Even regularized models can be seen as coarse-grained relative to an unregularized one. The distance function d(x,y) the authors use to compute r is the difference in loss between the full model and the coarse-grained one.
Back to Grokking
The paper’s main claim is that as training progresses, the model becomes increasingly compressible. That is: there exists a coarse-grained model CS that achieves nearly the same performance (within ε) as the original model M. And this happens precisely during the grokking phase. Moreover, at this point, the description length of the data (as compressed through the model) starts declining steadily - meaning the model is now encoding the dataset using a simpler, more compressible structure.
Why is this meaningful? Because the model is achieving low cross-entropy loss through compression. Its rate–distortion value is dropping — it's finding simpler explanations for the same data.
Hope I managed to explain this paper clearly - it’s one of those rare cases where theory, information compression, and deep learning dynamics come together to shed light on something truly counterintuitive.
https://arxiv.org/abs/2412.09810
Mike, I think you should call your Newsletter:
"The AI Scientist"
Great idea, ill try to figure out if I can modify it 😎