Introducing Neural Tangent Kernel (NTK)

A review of a classic deep learning paper by Jascha Sohl-Dickstein et al.

Feb 15, 2025

Paper Overview

This paper, called Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent, provides a rigorous mathematical characterization of the training dynamics of wide neural networks, demonstrating that their evolution under gradient descent can be effectively described by a linearized model in function space. This phenomenon emerges from a first-order Taylor expansion of the network function around its initialization.

The core result of the paper is that the Neural Tangent Kernel (NTK) governs the learning dynamics of wide neural networks. Specifically, as width increases, the NTK remains constant during training, leading to an analytically tractable linear differential equation for function evolution. This framework builds upon previous work on Neural Network Gaussian Processes (NNGPs) and extends it to capture the full trajectory of learning rather than just initialization statistics.

Neural Tangent Kernel w/o Tough Math

The Neural Tangent Kernel (NTK) is a mathematical tool used to describe how a neural network learns during training. It helps us understand how small changes in the network's weights affect its output.

Imagine a neural network as a complex function that takes inputs (like an image) and produces outputs (like a classification). During training, the network updates its weights using gradient descent, meaning it adjusts its internal parameters step by step to minimize errors. The NTK captures how these weight updates influence the network’s predictions.

The NTK is a giant matrix that measures how much each weight in the network contributes to changes in the output. If the NTK stays constant during training (which happens in very wide networks), the network behaves like a simple linear model, even though it is a deep, nonlinear function. This allows us to predict the learning behavior of the network using mathematical formulas, rather than running experiments.

Main Paper Contributions:

The paper makes the following key contributions:

Linearization of Training Dynamics: The function space dynamics of wide networks are equivalent to those of a linear model obtained from their first-order Taylor expansion.

Constancy of the NTK: The NTK, which characterizes function evolution under gradient descent, remains approximately constant throughout training, even for finite-width networks.

Gaussian Process Interpretation: The trained network function remains Gaussian-distributed, governed by a mean function that evolves deterministically under gradient descent.

Closed-Form Training Dynamics: For mean squared error (MSE) loss, the function evolution is explicitly solvable in terms of the NTK eigenvalues.

These results provide a precise mathematical foundation for understanding deep learning as a special case of kernel methods, bridging nonlinear neural networks with classical linear models in function space.

Gradient Flow and Linearization

The training of neural networks under gradient descent follows the continuous-time gradient flow equation:

\(\theta_t = \theta_0 - \eta \int_0^t \nabla_{\theta} L(\theta_s) ds\)

where θ_t represents the network parameters at time t, and L(θ) is the loss function. The key observation is that in the infinite-width limit, the network evolution simplifies dramatically due to the linearization of function dynamics. To formalize this, define the network output function (f is the trained neural net):

\(f_t(x) = f_{\theta_t}(x)\)

The Taylor expansion of f_t(x) around initialization θ_0 gives:

\(f_t(x) \approx f_0(x) + \nabla_{\theta} f_0(x) \cdot (\theta_t - \theta_0)\)

Introducing the weight perturbation vector

\( \omega_t = \theta_t - \theta_0\)

the function evolution obeys:

\(\dot{\omega_t} = -\eta \nabla_{\theta} f_0(X)^T \nabla_{f_t} L\)

Using this, the function space evolution is:

\(\dot{f}_t(X) = -\eta\ \Theta_0(X, X) \nabla_{f_t} L\)

where Θ_0(X, X) is the Neural Tangent Kernel, defined as:

\(\Theta_0(x, x') = \nabla_{\theta} f_{\theta_0}(x) \cdot \nabla_{\theta} f_{\theta_0}(x')\)

Recall that NTK is a kernel describing the evolution of deep neural networks during their training by gradient descent. In the above formula it is computed at the network initialization θ_0.

Key Consequences of Linearization

The function dynamics become linear in parameter updates, reducing the complexity of the nonlinear optimization problem. The NTK controls the learning trajectory, and since it remains approximately constant, function evolution is predictable. For small learning rates, gradient flow approximates kernel gradient descent in function space. This linearization is in the infinite-width limit, and empirically holds for moderately wide networks.

Gaussian Process Correspondence and Bayesian Interpretation

A central insight of the paper is the connection between wide neural networks and Gaussian processes (GPs). At initialization, the network function f_0(x) follows a Gaussian process prior:

\(f_0(X) \sim \mathcal{N}(0, K(X, X))\)

where K(X, X) is the Neural Network Gaussian Process (NNGP) Kernel, given by:

\(K(x, x') = \mathbb{E}_{\theta_0} [ f_{\theta_0}(x) f_{\theta_0}(x')]\)

Under gradient descent, the function distribution remains Gaussian, but now evolves according to the NTK:

\(f_t(X) \sim \mathcal{N}(\mu_t(X), \Sigma_t(X))\)

where the mean and covariance satisfy:

\(\mu_t(X) = \Theta_0(X, X) \Theta_0^{-1} (I - e^{-\eta \Theta_0 t}) Y\)

\(\Sigma_t(X) = K(X, X) - \Theta_0(X, X) \Theta_0^{-1} (I - e^{-\eta \Theta_0 t}) K(X, X)\)

Closed-Form Training Dynamics and Eigenvalue Analysis

For squared loss

\(L = \frac{1}{2} \| f(X) - Y \|^2 \)

the differential equation governing function evolution is explicitly solvable. Expanding in the eigenbasis of Θ_0 :

\(\Theta_0 v_i = \lambda_i v_i\)

the neural network trajectory is:

\(f_t(X) = (I - e^{-\eta \Theta_0 t}) Y + e^{-\eta \Theta_0 t} f_0(X)\)

Each eigenmode decays exponentially with rate λ_i, so the learning speed is determined by the NTK spectrum. Modes with large λ_i converge faster, while small eigenvalues correspond to slow learning components.

Conclusion:

This paper provides a mathematically rigorous characterization of neural network training in the large-width limit, showing that gradient descent effectively transforms deep networks into linear models in function space. By proving that training dynamics are governed by a fixed kernel, the paper establishes a direct mathematical link between deep learning and well-understood mathematical frameworks.

Mathy AI Substack

Discussion about this post

Ready for more?