Nested Learning: Is Deep Learning Just a Stack of Optimizers?
NESTED LEARNING: THE ILLUSION OF DEEP LEARNING ARCHITECTURES, Mike’s daily paper review: 13.11.25
Today’s paper fundamentally challenges the “stack of layers” paradigm that has defined deep learning for decades. The core, novel idea is to re-imagine a model not as a deep architecture, but as a deep stack of nested optimization problems.
This new paradigm is called Nested Learning (NL). It proposes that every component of a model, from its recurrent state to its optimizer, is its own learning module, complete with its own objective and, most importantly, its own update frequency.
The authors use the analogy of brain waves to explain this. Think of your model as having different clocks:
Low-Frequency: The “pre-training” step. In a standard Transformer, the Feed-Forward Network (FFN) parameters are updated during this phase and then frozen. Their update frequency drops from 1 (during training) to 0 (at inference).
High-Frequency: The recurrent state of an RNN (or the attention cache in a Transformer). These parameters update at every single token to reflect the immediate context.
The “illusion” of deep learning, they argue, is that we flatten this multi-frequency hierarchy into a single “architecture” and treat “training” as a monolithic, one-time event. NL provides a mathematical framework to deconstruct this and treat learning as a continuous, multi-timescale process.
The first “aha” moment from this framework is the deconstruction of standard optimizers. The paper’s novel claim is that an optimizer is a learning module.
Gradient Descent (GD): In NL, this isn’t just an update rule. It’s a 1-level optimization problem. The network’s weight matrix W is reframed as an “associative memory” M. The paper shows that the standard GD update W_t+1 = W_t - ... is mathematically equivalent to taking a single gradient step to solve the optimization problem min L(M(K); V), where K is the input data and V is the “local surprise signal” (the gradient).
GD with Momentum: The authors argue this is a 2-level nested optimization.
Level 1 (Slow): The main network weights, W, which are updated by the momentum term: W_t+1 = W_t + m_t+1. This is the outer optimization loop.
Level 2 (Fast): The momentum term itself, m. The paper reveals that the momentum update rule (m_t+1 = ...m_t...) is also a single gradient step solving its own, inner optimization problem. The momentum buffer m is an associative memory (a simple linear one) whose entire job is to compress the history of past gradients.
This reframing is the paper’s first set of novel proposals. If the momentum term is just a simple memory, why not make it more powerful? They suggest:
More Expressive Memory: Replace the simple linear update rule for m with a full-blown MLP. This creates “Deep Momentum Gradient Descent (DMGD),” where the optimizer itself is a deep model that learns to capture complex gradient dynamics, rather than just linearly averaging them.
More Expressive Objectives: Instead of the “momentum” in the optimizer being just a simple memory that accumulates gradients (like a Hebbian rule), the paper proposes changing its internal objective. The idea is to turn momentum into an error-correcting system, so it tries to predict the next gradient and updates based on the prediction error. This change allows the momentum memory to “manage its capacity” more intelligently. It learns to ignore redundant information and focus on new, surprising gradients, making the history storage much more efficient.
The second “aha” moment is applying this same logic to architectures. Take a linear attention model (or RNN). The NL perspective splits this into two distinct optimization problems operating at different frequencies:
Level 1 (Outer Loop): The “pre-training” of the query, key, and value projection matrices (W_q, W_k, W_v). This is a slow-frequency optimization that runs over the entire dataset and then stops.
Level 2 (Inner Loop): The recurrent state update M_t = M_t-1 + v_t * k_t. The paper shows this update is also a single step of a GD algorithm, solving an inner optimization problem (min <M*k, v>) at every single token. This state M is an associative memory learning to map keys to values within the immediate context.
NL provides a clean separation between these two processes. We typically backpropagate through the Level 2 process to update the Level 1 parameters. NL suggests these are two distinct learning systems.
The paper’s primary methodological contribution, the Continuum Memory System (CMS), is the direct and elegant application of this entire theory. In a standard Transformer, the FFN (MLP) block is the static, “long-term memory,” frozen after pre-training (update frequency = 0). CMS replaces this single, static FFN with a chain of FFNs, each operating at a different update frequency.
This means you can have:
A Low-Frequency FFN that stores the original pre-trained knowledge. Its parameters never get gradient updates after pre-training.
A Mid-Frequency FFN that accumulates gradients for, say, 1000 steps and then updates its weights once.
A High-Frequency FFN that accumulates gradients for only 16 steps and then updates its weights.
This creates a “continuum” of memory, directly inspired by synaptic consolidation in the brain. It allows a model to continually learn and consolidate new information into its parameters at different timescales, without catastrophically forgetting the core knowledge stored in the low-frequency blocks.
Finally, the HOPE model is the flagship architecture built from these principles. It combines a self-modifying recurrent module (which learns its own update rule) with the multi-frequency Continuum Memory System for knowledge storage.
By reframing everything as a nested optimization, this paper doesn’t just propose a new model; it suggests a new axis for model design, moving beyond just adding layers to adding levels of learning defined by update frequency.
https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/

