Multimodal Latent Language Modeling with Next-Token Diffusion
Mike's Deep Learning Daily Paper: 26.04.25
Today is Saturday, and today’s review will be lighter and fairly short. The focus is on multimodal generative models capable of "understanding" and generating data from multiple modalities — meaning text, images, audio, and similar types. The paper essentially connects latent generative models for textual data with those for more continuous data (even though that data, too, is discretized). The authors achieve this by training generative diffusion models for different types of data in the latent space. In other words, the model is trained to generate latent representations both for textual data and for other types of data, like images and audio.
Unlike many other works, the authors do not only train the multimodal generative model itself but also train an embedding model responsible for producing latent representations for each modality. Typically, in diffusion models, the embedding model is based on a VAE (Variational Autoencoder). However, the authors propose a slight but important modification to the standard VAE: instead of having the encoder generate both a mean vector and a variance vector for the latent representation, it generates only the mean vector. The variances are then sampled from a fixed Gaussian distribution with a predefined variance (a hyperparameter). According to the authors, this change prevents the collapse (i.e., zeroing out) of the variance vector generated by the encoder, a problem that would otherwise harm the diversity of the generated images.
The training process extends beyond just text. The authors also train a VAE for non-textual data such as images and audio. These are divided into tokens — image patches for images and time segments for audio — and fed to the model as sequential data. It’s important to note that the model treats data from every modality as sequential data. For text and audio, this is very natural since they have a clear inherent order. For images, although there is no strict natural sequence, one can still impose different orders (for example, left-to-right and top-to-bottom, or even right-to-left and bottom-to-top). This flexibility in sequencing image patches is an interesting design decision.
The diffusion model for non-textual data is trained to denoise a noisy latent representation given its context (all models in the paper are, of course, autoregressive). After the denoising step, the clean latent vector is fed into the VAE decoder in order to reconstruct the original data, with the overall training objective being to reconstruct the data as accurately as possible. For textual data, the noise is applied directly to the token embeddings, and a diffusion model is trained to recover them. Moreover, for textual data, a separate linear layer is trained to map the latent vector back into the token space (essentially, producing a softmax distribution over the vocabulary). It’s worth noting that the diffusion models are trained jointly with the VAE components — both encoder and decoder — which creates a tightly coupled and efficient training procedure.
Finally, to clearly separate textual from non-textual data during generation, the authors introduce special tokens that act as separators between different modalities. This allows the model to better understand the modality it is working with at each point during generation.