LoRA vs. Full Fine-Tuning: An Illusion of Equivalence
Mike's Daily Deep Learning Paper Review – 01.03.25
Today marks my 200th daily review since I started writing these summaries nine months ago. To celebrate this milestone, I’ll keep things short with a relatively light paper. The paper compares the effects of fine-tuning with LoRA versus full fine-tuning, where all model weights are updated.
Quick Recap: How LoRA Works
In LoRA, we train low-rank matrices that are added to the weight matrices in each layer. A low-rank matrix of size 𝑚 × 𝑛 can be factorized into the product of two smaller matrices: A of size 𝑟×𝑛 and B of size 𝑚 × 𝑟 where 𝑟 ≪ min(𝑚, 𝑛 ) (hence low-rank). Instead of fine-tuning the full model, LoRA updates only the matrices A and B while keeping the original weights W_0 frozen. The modification is applied to specific layers—typically the query (Q) and key (K) projections in the attention layers.
What Did the Authors Compare?
The paper examines how the weight matrices change after fine-tuning—comparing full fine-tuning vs. LoRA fine-tuning. Specifically, they analyze the singular vectors of the trained weight matrices using Singular Value Decomposition (SVD).
Quick SVD Recap:
SVD decomposes any matrixA into three matrices: A=USV^T where: U and V are orthonormal matrices (columns are mutually orthogonal and unit-norm). S is a diagonal matrix containing the singular values. By studying the singular vectors, the authors aimed to understand how much the structure of the weight matrices changes after fine-tuning with LoRA versus full fine-tuning.
Intruder Dimension (InDim): What Gets "Forgotten" in Fine-Tuning?
The authors introduce the concept of Intruder Dimension (InDim) - a dimension in the original weight matrix W_0 that effectively “disappears” after fine-tuning. But how is InDim Defined? For each singular vector of the original weight matrix W_0, the authors try to find a matching singular vector in the fine-tuned weight matrix (after training). The similarity is measured using cosine similarity. If no sufficiently similar singular vector is found, that dimension is classified as an Intruder Dimension (InDim)—meaning it has been lost or forgotten during fine-tuning.
Key Findings
Full Fine-Tuning Removes More Dimensions Than LoRA: The number of InDims is higher in full fine-tuning compared to LoRA fine-tuning. This suggests that full fine-tuning modifies the model’s internal structure more aggressively.
LoRA Rank (r) Affects How Much the Model Forgets. For very low-rank LoRA (r≈1), the number of InDims is relatively low. As r increases, the number of InDims rises, meaning the model is forgetting more of its original structure. However, at higher r values (e.g. r=64), the number of InDims starts decreasing again. This suggests a nonlinear relationship between LoRA rank and model forgetting.
LoRA Leads to More Forgetting Than Full Fine-Tuning: The authors argue that models fine-tuned with LoRA tend to "forget" more of their pre-trained knowledge than models that undergo full fine-tuning. This aligns with previous findings that fully fine-tuned models perform better on out-of-distribution (OOD) data, meaning they generalize better beyond the fine-tuning dataset.
Important Note:
The study focused on fine-tuning encoder-based models, specifically RoBERTa. The conclusions might differ for decoder-based architectures like GPT-style models.
Final Thoughts
This paper challenges the common assumption that LoRA fine-tuning is equivalent to full fine-tuning. The results suggest that while LoRA is computationally efficient, it alters the model’s internal representations differently, potentially leading to more forgetting of pre-trained knowledge. The implications for OOD generalization and optimal LoRA rank selection make this an interesting area for further research.
https://arxiv.org/abs/2410.21228