Physics Meets AI: How a New Model Learns Language Without Predicting a Single Token.

Mike’s Daily Paper: 04.08.25Rethinking Transformers Through the Lens of Physics: The Rise of Energy-Based Models

Aug 04, 2025

For years, the dominant paradigm for training LLMs has been deceptively simple: teach them to predict the next word. This autoregressive, likelihood-based approach has been wildly successful, but it has inherent limitations. Models trained this way think locally, token by token. They can lose track of global coherence, struggle with long-range dependencies, and find it difficult to satisfy complex, holistic constraints.

But what if, instead of teaching a model to predict the next step, we could teach it to recognize a good outcome when it sees one? A paper from a team at Stanford proposes exactly this, reframing the Transformer not as a sequential predictor, but as an Energy-Based Model (EBM). This isn't just a new architecture; it's a new philosophy, one that trades the local logic of likelihood for the global intuition of a physical system.

The Core Idea: From Predicting Tokens to Scoring Sequences

At its heart, an Energy-Based Model doesn't calculate the probability of a piece of data directly. Instead, it assigns a scalar value, energy, to any possible configuration. The core principle is simple: configurations with low energy are more probable, more stable, more "correct." Configurations with high energy are unlikely.

The authors of this paper apply this concept to language. Their Energy-Based Transformer (EBT) doesn't predict tokens. It reads an entire sequence of text and outputs a single number: its energy. A well-formed, coherent, and logical sentence gets a very low energy score. A garbled or nonsensical one gets a high score.

This is a fundamental shift. Unlike a standard GPT model, which is inherently directional and processes text one token at a time, an EBT is fully bidirectional. It can evaluate the global coherence of a sentence by looking at all its parts simultaneously, much like a human reader would.

Training Without Likelihood: The Art of Contrast

So how do you train such a model? If you can't maximize the likelihood of the next token, what's the objective? The answer is contrastive learning.

The training process is elegant:

You show the model a "positive" example, a real sentence from the training data—and teach it to assign this sentence a low energy score.
Then, you show it a "negative" example, a corrupted version of the sentence, perhaps with a few words randomly replaced. You teach the model to assign this nonsensical sentence a high energy score.

By repeating this process millions of times, the EBT learns to build an "energy landscape" for the entire space of possible sentences. Valid language resides in the low-energy valleys, while everything else is pushed up into the high-energy mountains.

Thinking and Generating with a Gradient

This global perspective is what unlocks the "thinker" in the model's title. Because the EBT scores the whole sequence, it excels at tasks requiring holistic reasoning and constraint satisfaction, where autoregressive models often fail.

Generation, however, is a different beast. You can't just sample from an energy landscape directly. Instead, the model has to find the low-energy valleys. The authors use an iterative technique inspired by physics called Langevin dynamics, a type of MCMC sampling. The process looks like this:

Start with a sequence of pure random noise (random tokens).
Calculate the energy of this garbage sequence.
Slightly nudge the tokens in the direction that most reduces the energy (i.e., move down the gradient of the energy function).
Repeat this process hundreds of times.

Slowly, iteratively, the random sequence is refined, settling from the high-energy mountains down into a low-energy valley, emerging as a coherent, well-formed sentence. While this process is slower than standard autoregressive generation, it allows for a much more controlled and globally-aware form of creation.

Why It's a Scalable Learner and Thinker

The paper provides strong evidence that this approach scales. As the models get bigger, their ability to distinguish good sequences from bad ones improves, and the quality of the generated samples gets better.

More importantly, the energy-based framework is incredibly flexible. You are no longer yoked to next-token prediction. Want a model that generates positive movie reviews? Just add another "energy term" to the training objective that penalizes negative sentiment. This modularity makes EBTs a powerful tool for controllable generation.

This work forces us to reconsider the foundations of our current models. It suggests that the path to more robust, coherent, and controllable AI may not lie in simply scaling up next-token prediction, but in building models that understand language on a more holistic, physical level.

https://arxiv.org/abs/2507.02092

Mathy AI Substack

Discussion about this post