The FFT Strikes Back: An Efficient Alternative to Self-Attention
Mike's Daily Deep Learning Paper Review – 28.02.25
Introduction:
Anyone who's been following me long enough knows that I have a soft spot for papers featuring the Fourier Transform or any other periodic transform (such as the Discrete Cosine Transform, DCT). This affinity stems from the five years I spent as a researcher, algorithm engineer, and lecturer in the field of signal processing, specifically in wireless communication systems.
This paper proposes a mechanism that replaces self-attention with a layer that transforms token representations into the frequency domain (i.e., applies a Fourier Transform). The key claim in the paper is that this method has a level of expressiveness (i.e., the ability to model the same types of functions) comparable to Transformers. However, the claims are only partially theoretically proven (a full proof is not provided in the paper).
Why Consider a Frequency-Domain Approach Instead of Self-Attention?
The main advantage of this Fourier-based non-linear transformation approach is, of course, its lower computational complexity. Instead of the O(N^2) complexity of self-attention, the proposed mechanism operates at O(NlogN) - a significant reduction. Here, NNN represents the sequence length. It is well known that the Fast Fourier Transform (FFT) can be computed in O(NlogN) time, and despite the introduction of nonlinear transformations in the frequency domain, the overall complexity of the proposed mechanism remains O(NlogN).
How Does This Work?
Fourier Transform on Token Representations
The first step is to apply the FFT across the token dimension, meaning each dimension of the token embeddings is transformed separately.
Example: If we have 10K tokens, each represented by a 1024-dimensional vector, this results in 1024 independent Fourier Transforms, each of length 10K.
Nonlinear Processing in the Frequency Domain
Compute the mean of all transformed representations in the frequency space.
Pass the result through an MLP (fully connected layers), which outputs a transformed version of the sequence, preserving the original shape (e.g., 10K×102410K in this example).
Add the result to a base weight matrix W_base, which consists entirely of ones.
Adaptive Reweighting of the Fourier Transformed Representations
Perform element-wise multiplication between the transformed representation and the original FFT output.
This results in a nonlinear reweighting of the Fourier Transform of the token embeddings, where the weights are learned dynamically.
Final Nonlinearity & Inverse Fourier Transform (IFFT)
Apply a modReLU (ReLu for complex numbers) to the complex-valued representation.
Convert back to the original token space using the Inverse FFT (IFFT).
And the Results? Not Bad at All:
The experimental results show that this FFT-based approach performs competitively while maintaining significantly lower complexity than standard self-attention mechanisms. Definitely an interesting direction for more efficient sequence modeling!