This paper presents an interesting approach to training generative diffusion models — an enhancement of the Flow Matching (FM) technique, which has become the leading method for training diffusion models. Essentially, the paper trains a model to estimate a trajectory (typically a straight line, which is the simplest path, though other works have used more complex ones) between a simple Gaussian distribution and the data distribution (images, video, or audio). The main claim is that using this method, one can generate data in just a single iteration.
The model is trained to generate a velocity (i.e., a gradient) along this trajectory at every time point t, where t represents the noise level which goes from pure noise (t = 0) to data point (t = 1). Once this velocity is estimated, a data sample can be generated by numerically solving an ODE, plugging in the predicted velocity along the way. For a linear trajectory, this velocity is constant (as it’s the derivative of a straight line). However, this linear assumption sometimes fails in practice as the resulting trajectories can be nonlinear and complex, which leads to poor sample quality.
To improve this, the paper suggests replacing straight-line paths with piecewise linear trajectories, essentially a type of linear spline rather than forcing the model to always follow a global straight path. The displacement of a data point within each small segment depends only on the current point x_t, the timestep t, and the spline granularity d (I'll expand on that later). These short segments are referred to as shortcuts in the paper. The model is trained to estimate them using a consistency loss, which encourages the model to behave “consistently” across consecutive shortcut segments. This loss is derived from a simple combination of the update rules for two adjacent shortcut steps.
Then, the authors combine this consistency loss with the standard FM loss (computed along the straight trajectory). Since the shortcut path can be constructed with different granularities, i.e., varying numbers of linear sub-segments; the training leverages this by training the model across multiple resolutions. Given a timestep t (noise level), a noisy input sample, and a spline granularity d, the model is trained to predict the data point shift over the next segment (and there are d such steps total). Then, an ODE is solved to move the point accordingly. That new point is again passed through the model to predict the next shift. The consistency loss is then applied across two such consecutive shifts.
Overall, a really interesting and well-written paper, I highly recommend reading it!