Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning Processes
Mike’s Daily Paper Review: 14.06.25
This paper is quite heavy, but I tried to make the review accessible (I didn’t dive too deep myself because the paper is genuinely complex).
There’s something deceptively simple about Stochastic Gradient Descent, or SGD. For years, it's been the backbone of machine learning (ML), especially deep learning, yet the question of why it works has largely remained in the realm of vague intuition. Most explanations seem to circle back to fuzzy claims like “it finds flat minima” or “noise helps escape local minima.” The paper I’m reviewing today tries to bring order to this, and unlike most work in the field, it offers a completely new angle: it describes SGD as a diffusion process evolving over time — seen through the lens of partial differential equations (PDEs).
The authors aim to shift our understanding of learning dynamics. No longer is it about tracking a single point in weight space rolling down a loss landscape. Instead, they propose describing the entire probability distribution of possible configurations namely the density over network weights space, as it evolves with time. If you come from mathematical physics, this will immediately remind you of the Fokker–Planck equation, which describes how particles diffuse in a system. The analogy here is powerful: the weights are like particles moving along the gradient of the loss function, with added noise from the stochastic nature of SGD (i.e., mini-batch selection).
What’s striking is that this physical model doesn’t just mimic what SGD does but it explains its success. For example, when analyzing the kinetic energy of the system, the stochastic “noise” from mini-batch selection turns out to be far more than a nuisance but plays a critical role in stability. It balances progress so that the system neither races too quickly nor gets stuck in unstable regions. The authors show that there are energy constraints dictating the pace of learning and that noise magnitude is directly tied to how deeply the system can descend the loss.
The paper also makes a sharp conceptual distinction between two views of learning:
The local view that looks at parameter updates step by step, and
The global view that describes the entire evolving distribution as a continuous probability flow in weight space.
Just like in physics, shifting from a pointwise to a distributed description uncovers insights that were previously hidden. Suddenly, we can ask not just where the weights are going, but where they're concentrating, how they spread, and how the structure of the loss function shapes that behavior.
One of the most impressive parts of the paper is the analysis of temporal recursion. The authors don’t stop at showing that SGD converges but they investigate how its recursive structure, based on repeatedly following gradients, aligns with the continuous dynamics of the physical PDEs. This comparison between discrete-time recursion and continuous diffusion lets them articulate, for the first time, general principles about when SGD succeeds, when it might fail, and how we can control that behavior.
But what struck me the most is that this whole framing opens the door to future development. If we accept the paradigm that SGD is not merely a greedy downhill process, but a physical system governed by differential laws, then maybe we shouldn’t focus on improving SGD per se. Instead, we could shift to PDE-guided training, where we directly model the desired evolution of the distribution and solve backwards to find the matching dynamics.
In that sense, this paper doesn’t just explain SGD’s past but offers a bold future for deep learning. A future where we’re not blindly groping through high-dimensional surfaces, but designing dynamic systems with explicit physical structure. It’s nothing short of a paradigm shift the one that may change how we approach optimization entirely.
https://arxiv.org/abs/2501.08425