A Deep Mathematical Tour of Generative Models in Deep Learning

A short survey of the most popular generative models in deep learning

Apr 05, 2025

Generative modeling aims to learn probability distributions over complex, high-dimensional data spaces. In deep learning, this typically involves parameterizing a distribution p_θ(x) using neural networks, with the goal of sampling from or computing probabilities under this distribution in a way that closely aligns with observed data. The field has evolved rapidly, giving rise to a rich taxonomy of generative models, each with distinct mathematical motivations, architectural constraints, and inference strategies.

This post takes a mathematically rigorous tour through several key generative model families, including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), Diffusion Models, Consistency Models, Rectified Flows Normalizing Flows and Energy-Based Models (EBMs).

Variational Autoencoders (VAEs)

VAEs are rooted in variational inference. They define a latent variable model:

\(p_\theta(x, z) = p_\theta(x|z)p(z)\)

where x is data and z is a latent (unknown) variable “from which x is generated”. The marginal

\(p_\theta(x) = \int p_\theta(x|z)p(z)dz\)

is generally intractable, so VAEs introduce an approximate posterior q_ϕ(z∣x) and optimize the Evidence Lower Bound (ELBO):

\(\mathcal{L}(\theta, \phi) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))\)

Key ideas:

Amortized inference via a recognition network q_ϕ(z∣x)
Reparameterization trick for gradient-based optimization (to train parameters of the latent distribution from which data is sampled during training)
KL divergence regularization encouraging structured latent representations

Extensions of VAEs include:

Beta-VAEs: Introduce a scaling factor on the KL term to control disentanglement.
Hierarchical VAEs: Employ multiple layers of latent variables to capture more complex dependencies.
Importance-weighted Autoencoders (IWAE): Use tighter bounds with multiple samples from q_ϕ(z∣x).

Generative Adversarial Networks (GANs)

GANs define an implicit generative model by transforming latent samples z∼p(z) through a generator G_θ(z) to produce data x. Rather than defining a likelihood, GANs introduce a discriminator D_ϕ(x) and optimize a minimax game:

\(\min_G \max_D \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z)))] \)

This equation represents the adversarial training process where the generator G aims to minimize the discriminator's ability to differentiate real from fake data, while the discriminator D tries to maximize its ability to correctly classify the data. This setup leads to a minimax game.

The objective can be interpreted as minimizing the Jensen-Shannon divergence between the data distribution p_data and the distribution induced by the generator p_G. Advanced GAN variants such as the Wasserstein GAN (WGAN) replace the JS divergence with the Earth Mover's Distance, improving stability during training.

Diffusion Models

Generative diffusion models are a class of deep generative models that construct data samples by reversing a gradual noising process. They leverage the principle of score-based modeling and stochastic differential equations (SDEs) to define a forward process that progressively destroys structure in the data, and a learned reverse process that reconstructs it. This approach has achieved state-of-the-art results in image, audio, and 3D generation, offering advantages in sample diversity and training stability over GANs.

Forward Process

Given a data sample x_0∼p_data(x), the forward (diffusion) process gradually adds Gaussian noise over T steps:

\(q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t \, \mathbf{I})\)

with a variance schedule

\(\{\beta_t\}_{t=1}^T\)

typically increasing over time. This defines a Markov chain:

\(q(x_{1:T} \mid x_0) = \prod_{t=1}^T q(x_t \mid x_{t-1})\)

A key property is that the marginal distribution

\(q(x_t \mid x_0) \)

is also Gaussian:

\(q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} \, x_0, (1 - \bar{\alpha}_t) \, \mathbf{I})\)

where

\(\alpha_t = 1 - \beta_t\ \ \text{and} \ \ \bar{\alpha}_t = \prod_{s=1}^t \alpha_s. \)

This allows direct sampling of noisy data at arbitrary time steps.

Reverse Process: Denoising

The generative model learns the reverse-time process:

\(p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))\)

Typically, the model is trained to predict the mean μ_θ, while the covariance Σ_θs fixed or parameterized simply. An alternative and more effective formulation is to train a neural network ϵ_θ(x_t, t) to predict the noise added at step t:

\(\mathcal{L}_{\text{simple}} = \mathbb{E}_{x_0, \epsilon, t} \left[ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t) \|^2 \right]\)

where ϵ∼N(0, I). This loss has a simple interpretation: the network learns to denoise a sample corrupted by known noise.

Likelihood and Sampling

Diffusion models are latent-variable models with a tractable evidence lower bound (ELBO):

\(log(p_\theta(x_0)) \geq \mathbb{E}_q \left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T} \mid x_0)} \right]\)

Although the reverse process requires many steps to sample (typically hundreds to thousands), recent advancements such as DDIM (Denoising Diffusion Implicit Models) and stochastic differential equation solvers have significantly accelerated sampling.

Score-Based Formulation of Generative Diffusion Models

An equivalent continuous-time formulation uses SDEs. The forward SDE:

\(d x_t = f(x_t, t) \, dt + g(t) \, d w_t\)

defines a noise injection process, and the reverse SDE:

\(d x_t = \left[ f(x_t, t) - g(t)^2 \nabla_{x_t} \log p_t(x_t) \right] dt + g(t) \, d \bar{w}_t\)

requires estimating the score function

\(\nabla_{x_t} \log p_t(x_t).\)

This is learned via score matching using denoising score networks.

Diffusion models unify score-based generative modeling and likelihood-based learning, providing a principled framework with strong theoretical backing and impressive empirical results across modalities.

Consistency Models

Consistency models are a class of generative models that accelerate sampling by enforcing consistency across denoising predictions at different noise levels. Unlike diffusion models, which generate samples through a long sequence of incremental denoising steps, consistency models aim to collapse this process into a small number of evaluations—often just a few—while retaining high sample quality. Consistency models can be considered as a type of the generative diffusion models we discussed earlier.

Let x_0∼p_data be a clean data point, and let x_t be a noisy version of x_0 at noise level t ∈ [0, 1], typically generated using a forward corruption process. The core idea is to learn a function f_θ(x_t, t)) that directly maps a noisy input at any time t back to a clean sample estimate. A consistency condition is then imposed to ensure that predictions made at different time steps are compatible with each other:

\(f_\theta(x_t, t) \approx f_\theta(x_s, s) \quad \text{for } s < t\)

To train this model, one generates a pair (x_t, x_s) such that both originate from the same clean sample x_0, i.e.,

\(x_t = \mathcal{C}(x_0, t),\ \ x_s = \mathcal{C}(x_0, s), \)

with C representing the corruption process. The consistency loss is then defined as:

\(\mathcal{L}_{\text{consistency}} = \mathbb{E}_{x_0, t, s} \left[ \left\| f_\theta(x_t, t) - f_\theta(x_s, s) \right\|^2 \right]\)

Additionally, a supervised reconstruction loss can be added at the final denoising level:

\(\mathcal{L}_{\text{reconstruction}} = \mathbb{E}_{x_0, t} \left[ \left\| f_\theta(x_t, t) - x_0 \right\|^2 \right]\)

The total loss is a weighted combination:

\(\mathcal{L}_\theta = \mathcal{L}_{\text{consistency}} + \lambda \mathcal{L}_{\text{reconstruction}}\)

This formulation allows the model to learn a single function f_θ that generalizes across all noise levels, effectively collapsing the denoising trajectory. Sampling becomes extremely efficient: starting from a noisy sample x_t∼C(x_0, t) one can directly obtain an approximation of the clean data using a single or small number of function evaluations. Empirically, consistency models achieve quality comparable to diffusion models like DDPM or DDIM, but with drastically reduced sampling time—making them especially attractive for real-time applications.

Rectified flows

Rectified flows are a class of generative models that define a deterministic transformation from a simple base distribution (e.g., standard Gaussian) to a complex data distribution by learning a continuous-time velocity field, avoiding the stochastic nature and inefficiencies of diffusion models. Given a prior sample z_0∼N(0,I), rectified flows define a transport map via the ordinary differential equation (ODE):

\(\frac{d z(t)}{dt} = v_\theta(z(t), t), \quad z(0) = z_0\)

where v_θ is a neural network representing the time-dependent velocity field. The goal is to find vθ such that integrating the ODE from t = 0 to t = 1 transports the initial distribution p_0 to the data distribution p_1. A key idea is to "rectify" the trajectories of samples by learning a velocity field that makes their flow paths straight in an appropriate feature space, minimizing transport complexity.

The training objective is derived from matching the velocity field to the score function of interpolated distributions p_t, which lie between p_0 and p_1. This leads to the loss:

\(\mathcal{L}(\theta) = \mathbb{E}_{t \sim \mathcal{U}[0, 1]} \, \mathbb{E}_{z(t) \sim p_t} \left[ \left\| v_\theta(z(t), t) - \nabla_{z(t)} \log p_t(z(t)) \right\|^2 \right]\)

In practice, the intermediate distributions p_t are constructed using a linear interpolation:

\(z(t) = (1 - t) x + t z, \quad x \sim p_{\text{data}}, \; z \sim p_0\)

The score

\(\nabla_{z(t)} \log p_t(z(t)) \)

is approximated via a separate score network or using denoising techniques. Because rectified flows define deterministic paths, samples can be generated with a single forward ODE integration, often requiring orders of magnitude fewer steps than diffusion models while preserving sample quality

Energy-Based Models (EBMs)

Energy-Based Models (EBMs) are probabilistic models that define a scalar energy function E(x; θ) over the data space X, where x ∈ X represents a data sample and θ are the model parameters. The model assigns lower energy to more likely or preferred configurations, with the goal of learning a distribution that minimizes the energy for observed data while maximizing it for noise or unobserved configurations. The probability distribution p(x; θ) over data x is given by the Boltzmann distribution:

\(p(x; \theta) = \frac{\exp(-E(x; \theta))}{Z(\theta)} \)

where

\(Z(\theta) = \int_{\mathcal{X}} \exp(-E(x; \theta)) dx\)

is the partition function, which normalizes the distribution. The challenge in EBMs lies in efficiently computing Z(θ), which is often intractable for high-dimensional data.

Learning in EBMs typically involves adjusting the parameters θ to minimize the energy of the observed data. One common method for training EBMs is Contrastive Divergence, which approximates the gradient of the log-likelihood log⁡p(x; θ) using a stochastic process:

\(\nabla_\theta \log p(x; \theta) \approx \mathbb{E}_{q(x)} \left[ \nabla_\theta E(x; \theta) \right] - \mathbb{E}_{p_{\text{data}}(x)} \left[ \nabla_\theta E(x; \theta) \right]\)

where q(x) is the distribution of x under the model after a few steps of a Markov chain, and p_data(x) is the true data distribution. This gradient-based approach allows for learning without the need for computing the intractable partition function Z(θ), though it introduces challenges related to convergence and the quality of the approximation.

EBMs are flexible, allowing for a variety of architectures and applications, including generative modeling, image generation, and representation learning. However, the difficulty of exact inference and efficient training methods continues to be a significant challenge in scaling EBMs to large, high-dimensional datasets.

Closing Thoughts

Generative models are increasingly at the heart of modern machine learning, not just for image or text generation, but for representation learning, simulation, and reasoning under uncertainty. From variational inference to optimal transport, these frameworks rest on deep mathematical ideas. A robust understanding of their underpinnings enables principled extensions, hybrid architectures, and more interpretable learning systems.

As the field evolves, we are beginning to see synthesis between these approaches—score-based methods with adversarial training, diffusion models in latent flow spaces, EBMs regularizing VAEs. The next frontier will likely involve unifying perspectives across probabilistic modeling, dynamical systems, and optimal control.

For those aiming to innovate in generative modeling, the math is not just a foundation—it’s a design space.

Mike’s Substack

Discussion about this post