The Sculptor in the Machine: Why Your Optimizer Isn’t Just a Race Car
Optimizers Qualitatively Alter Solutions, And We Should Leverage This, Mike’s Daily Paper: 02.10.25
For the better part of a decade, the deep learning community has treated optimizers like race cars. The mission was simple: get to the bottom of the loss landscape as fast as possible. We benchmarked them on speed: iterations, FLOPs, and wall-clock time. Adam was faster than SGD, and we celebrated. The implicit assumption, borrowed from the clean, predictable world of convex optimization, was that the destination was pre-determined. We were all driving to the same valley; some just had better engines.
The paper we review argues that this mental model is not only incomplete but also profoundly misleading. In the wild, non-convex terrain of neural network training, the optimizer is not a racecar. It’s a sculptor. The path it carves through the high-dimensional parameter space doesn’t just determine the speed of arrival; it determines the very shape of the final statue. This is the paper’s central, paradigm-shifting insight. The choice of learning algorithm is not merely a tool for convergence but a powerful and underexploited source of inductive bias. It actively shapes the qualitative nature of the solution we find.
The Convex Illusion
Our obsession with convergence speed is an artifact of a bygone era’s anxieties. Early skepticism about neural networks centered on the fear of getting stuck in “bad” local minima. A wave of empirical and theoretical work in the early 2010s seemed to pacify these fears, suggesting that for sufficiently large models, most local minima were of similar quality. The landscape, we were told, was effectively “well-behaved.”
This narrative, while useful, blinded us to a deeper truth of non-convexity: the existence of many different kinds of good solutions. If multiple distinct, high-performing minima exist, then the algorithm we use to navigate the landscape becomes a critical factor in determining which one we find. An optimizer that takes greedy, independent steps (like SGD) will follow a different trajectory and land in a different basin of attraction than one that understands the landscape’s curvature and the intricate correlations between parameters (like a second-order method). They aren’t just taking different routes to the same city; they are ending up in entirely different countries.
The Optimizer as a Source of Bias
The paper argues that we should view the optimizer as a primary lever for controlling how a model learns. The learning rule is fundamentally a mechanism for credit assignment—deciding which of the millions of parameters gets blamed for an error. A simple optimizer assigns this blame locally and myopically. More sophisticated methods, particularly those using non-diagonal preconditioners, perform this assignment with a richer understanding of parameter interplay.
This difference has profound consequences. An optimizer that corrects for “wasteful movement” in parameter space, where updates along different dimensions effectively cancel each other out, will encourage the network to find more localized and efficient representations. It nudges the solution towards a lower-dimensional subspace, a property that is incredibly desirable for tasks like continual learning, where minimizing interference is key. This isn’t a bias we are adding with an explicit regularization term; it’s an emergent property of the optimization dynamics itself. The learning process becomes a tool for encoding desiderata, like sparsity or robustness, directly into the solution.
The paper grounds this abstract idea with concrete, compelling examples. In continual learning, for instance, a second-order optimizer like Shampoo, which accounts for the relationships between parameters, finds solutions that are more robust to catastrophic forgetting than those found by Adam. It does so by learning more compressed and localized representations, effectively using the model’s capacity more efficiently. This isn’t a small tweak; it’s a qualitatively different solution, one that is better suited for a dynamic environment.
Another example reimagines a technique for inducing sparsity as a custom preconditioner. By designing an optimizer that makes it difficult for weights near zero to grow (effectively creating saddles in the loss landscape), learning is biased towards solutions where only a few parameters become large. Here, the optimizer intentionally slows down convergence in certain directions to achieve a desired structural property, sparsity, a trade-off the “race car” mentality would never permit.
Expressivity vs. Reachability
Perhaps the sharpest point the paper makes is the distinction between what a model can represent and what it can learn. We have long debated the theoretical expressivity of architectures, asking questions like, “Is this model Turing complete?” This, the authors argue, is a largely academic exercise if we ignore the constraints of learning.
An architecture defines a vast universe of possible functions. But the optimizer, coupled with initialization and data, carves out a much smaller, reachable subset within that universe. A recurrent neural network might theoretically be able to represent a complex algorithm, but if gradient-based methods can’t find that solution from a random starting point, its theoretical power is moot. The optimizer, therefore, defines the effective expressivity of our models. It is the gatekeeper between the representable and the reachable.
This reframes the entire practice of model design. We are used to baking biases into architectures. This paper makes a compelling case that it is equally valid, and sometimes more elegant, to bake them into our optimizers. There is a duality here that we have largely ignored.
The authors call for a fundamental shift in how we evaluate and design learning algorithms. Instead of asking “How fast is it?”, we should be asking “What kind of solution does it produce?”. It’s a call to move beyond the racetrack and enter the sculptor’s studio.
https://arxiv.org/abs/2507.12224