Do All Tokens Need the Same Amount of "Thinking"? Mixture-of-Recursions Says No.

Mike's Daily Paper: 02.08.25 - Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation

Aug 02, 2025

Background: The Efficiency Squeeze

Let's start with a truth we all know in the AI world: making language models bigger unlocks incredible power, but it comes at a huge cost. The massive amount of computer power and memory needed to train and run these models means they are mostly limited to a few giant data centers. This has sparked a big search for more efficient model designs.

So far, this search has followed two main paths. The first path is parameter efficiency, which tries to get more performance from fewer model weights. A common trick here is sharing parameters, where the same set of weights is used in different parts of the model. The second path is adaptive computation, where the model only uses more computer power on the parts of the input that are truly hard, letting simpler parts take an easier route.

While both ideas have worked well on their own, a single model that does both at the same time has been missing. Recursive Transformers, which use a shared set of layers over and over, seemed like a good start because of their built-in parameter sharing. However, most of them used a fixed number of steps for every token, so they couldn't really adapt to the input.

The Big Idea: Mixture-of-Recursions (MoR)

This is where the reviewed paper, "Mixture-of-Recursions (MoR)," makes its mark. It introduces a new and combined framework that smartly mixes the two types of efficiency into one simple design.

Basically, MoR is a Recursive Transformer. This means it uses a shared "recursion block", a stack of layers, multiple times to process text, which keeps the number of parameters low. But the real novelty is how it decides how many times to use this block. Instead of a fixed number for all tokens, MoR uses small routers that decide on the fly how many recursion steps each individual token needs.

Think of it this way: for a simple token like the word "the," the router might decide one trip through the block is enough. But for a more meaningful or complex token like "defensively," the router might send it through the block three times, giving it more "thinking" time. This is the mix of saving parameters and saving computation.

The Novelty: A Trifecta of Efficiency

The genius of MoR is that it doesn't just combine two ideas; it creates a positive loop where one good thing leads to another. The framework's novelty comes from three connected parts that work together:

Parameter Sharing via Recursion: The foundation of MoR is reusing a single block of parameters. This naturally cuts down on the number of unique weights the model needs, making the model itself smaller and lighter from the start.
Adaptive "Thinking" Depth via Routing: This is the main new idea in the design. By training a router to give each token its own number of recursion steps, MoR avoids the rigid, same-for-everything approach of older models. This isn't just a trick added later; it's a basic part of how the model is trained from scratch, allowing it to learn how to use its computer power wisely.
Smarter Memory Access: This is a powerful and direct result of the adaptive depth. In a normal Transformer, the Key-Value (KV) cache is a big memory problem during inference. With MoR, if a token leaves early after just one recursion, the model doesn't need to calculate or store its KV pairs for the deeper steps. This smart, on-the-fly caching reduces memory traffic and, most importantly, reduces the costly attention calculation to only the tokens that are still active at that depth.

This three-in-one package allows MoR to tie weights to save parameters, route tokens to save computation, and selectively cache key-values to save memory traffic; all inside one model.

The "How": A Glimpse Under the Hood

The paper explores different ways to build this, focusing on two main choices:

Routing Strategies: Deciding how to route tokens involves a choice between two options. With expert-choice routing, each recursion step is an "expert" that picks the top-k tokens to continue. This gives a fixed amount of work but can cause issues with the order of information during training. With token-choice routing, each token gets its full path assigned at the very beginning. This avoids the ordering problem but can lead to unbalanced work, with some steps getting too many tokens and others too few.
KV Caching Strategies: The authors also suggest two ways to handle the KV cache. Recursion-wise caching stores KV pairs just for the active tokens at each step, which saves on computation. The other option, recursive KV sharing, caches all KV pairs at the first step and reuses them for all the deeper steps. This greatly cuts down on memory and can speed up the initial processing of a prompt, making it a good option when memory is tight.

The Bottom Line: Pushing the Pareto Frontier

The test results are great. Across different model sizes (from 135M to 1.7B parameters), MoR sets a new standard for efficiency.

With the same amount of training compute, MoR models get better results (lower test error and higher accuracy on new tasks) than regular and older recursive models, even with up to 50% fewer parameters. When trained on the same amount of data, MoR gets better results while using 25% fewer computing steps and cutting down on training time and memory.

The design also scales up well. As models get bigger, MoR not only matches but often beats the much larger vanilla Transformer, all while using about a third of the unique parameters.

Why This Matters: A Conceptual Shift

Mixture-of-Recursions is more than just a clever trick. It's a new way of thinking about how models are built and how they work. It treats model "depth" not as a fixed number, but as a flexible resource to be used as needed for each token.

This framework smartly changes our view of a model's "thinking" process into a form of hidden reasoning, where how much it "thinks" depends on how hard the concept is. By combining parameter sharing with adaptive computation, MoR offers a powerful and scalable way to get the performance of huge models without their huge costs. It's a complete solution that points to a future of smarter, more efficient, and easier-to-use language models.

https://arxiv.org/abs/2507.10524

Mathy AI Substack

Discussion about this post