Capacity vs. Complexity: Understanding What a Model Can Do vs. What It Actually Learns
In machine learning, we often throw around the word “complexity” loosely, especially when we talk about model size, architecture depth, or parameter count. But buried beneath this everyday usage are two very different ideas:
Model capacity: the theoretical expressivity of a model meaning what it can represent.
Function complexity: the intricacy of the specific function the model actually learns meaning what it does represent.
And conflating the two leads to critical misunderstandings especially in deep learning, where huge models often generalize surprisingly well.
Let’s dive in by first dissecting capacity more formally.
Model Capacity: The Hypothesis Set
At its core, model capacity is about the hypothesis class H, the set of all functions your model could possibly learn, given infinite data, unlimited compute, and perfect optimization. There are multiple ways to formally measure how “big” or “rich” H is. Each measure captures different structural aspects of expressivity:
VC Dimension
The Vapnik–Chervonenkis (VC) dimension quantifies how well your model can fit arbitrary binary labels. Formally, it's the size of the largest set of inputs that your model class can shatter — that is, realize every possible binary labeling using some function in H. If your class can shatter n points, it can express 2^n labelings.
So:
High VC dimension ⇒ can memorize more arbitrary patterns ⇒ high capacity.
But also ⇒ higher risk of overfitting without strong regularization.
Rademacher Complexity
While VC dimension is a worst-case capacity measure (independent of data), Rademacher complexity is data-dependent. It measures how well your model class can align with pure noise on your actual dataset. You assign random ±1 labels to the inputs and check how well the best function in your class can correlate with those labels. The better it fakes structure in randomness, the higher the complexity. Rademacher complexity often yields tighter, more practical generalization bounds and reflects capacity on the data you actually care about.
Covering Numbers
Covering numbers measure how many small “balls” in function space are needed to cover your hypothesis class under some norm (e.g., L2, uniform). Each ball contains functions that are similar within ε error.
If your class contains wildly different functions (spread out across function space), you need many balls → high covering number → high capacity. If the class contains mostly smooth, similar functions, you need fewer → lower capacity. This geometric perspective is useful in statistical learning theory, compression, and generalization analysis.
NTK Rank / Tangent Kernel Spectrum
In the infinite-width limit, neural networks exhibit linearized dynamics governed by the Neural Tangent Kernel (NTK).
The rank or spectrum of the NTK reflects how “wide” the reachable function space is — essentially a proxy for capacity in overparameterized settings.
Full-rank NTK ⇒ network can learn a wide variety of functions.
Low-rank NTK ⇒ the function space is narrow, limiting what the model can express.
This connects model architecture (width, depth, activation) directly to function space geometry and capacity.
Why This Matters for Modern Deep Nets
Modern deep networks, especially large-scale transformers have immense capacity. They can perfectly interpolate training sets, even with noisy or random labels. So clearly, they could learn high-complexity, high-variance functions if they wanted to.
But here’s the twist: they usually don’t. And this is the heart of the mystery: why do models with enormous capacity often learn relatively simple solutions that generalize well? This is where the complexity of the learned function — not the size of the hypothesis class — becomes the more relevant lens.
The answer lies in:
The geometry of the data manifold,
The inductive biases of the architecture,
The implicit regularization of the optimizer,
And often, sheer luck in the optimization path.
Measuring Complexity: The Learned Function
If model capacity is about how large the hypothesis set H is, then complexity is about the information content of the specific function your model actually learns. Here are key ways to measure or characterize complexity:
Kolmogorov Complexity
This is the length of the shortest program (in bits) that outputs the function. Simple functions (like linear mappings or identity functions) have low Kolmogorov complexity. Arbitrary functions with no regularity (e.g., memorized noise) have high Kolmogorov complexity. If your trained model maps input to output in a way that can be compressed or described simply, it's low-complexity.
Norms of Model Parameters
In practice, especially in deep learning, we often use proxy metrics like:
ℓ2 of the weights,
Spectral norms of layers,
Path norms in ReLU nets.
These norms act as a measure of "function smoothness" or regularity. Many generalization bounds depend on these norms.
Minimum Description Length (MDL)
From an information-theoretic standpoint: a function is simple if you can encode both the model and the data it explains with few bits. Compression techniques (e.g., pruning, quantization, distillation) reveal something deep: models that generalize well can usually be compressed. Compression is a proxy for low function complexity.
Flat Minima and Hessian Spectra
Empirical studies show that SGD tends to converge to flat minima which are regions in weight space where the loss doesn't change much. Flatness often correlates with low function complexity and better generalization.
Why Complexity, Not Capacity, Explains Generalization
All the best-performing models today have massive capacity. But they tend to converge to low-complexity solutions:
Neural nets trained on real-world data tend to learn smooth, structured functions.
Large language models memorize far less than you'd expect given their size.
Even when models interpolate the training data perfectly, their learned functions often lie on low-complexity manifolds.
The real story isn’t in what your model can represent. It’s in what it actually ends up representing and that story is told by complexity.
Some Final Thoughts
We tend to obsess over model size, parameter count, and architecture. But those are proxies for capacity, and capacity is just potential. It’s like handing someone an entire dictionary and asking them to write a poem.
Complexity is what they write.
And if we want to understand generalization, robustness, and scaling in modern ML, we need to stop measuring the book by its cover. We need to read the function inside.

