[FEAT] Integrate LoRA-One into PEFT
Feature request
Paper: https://arxiv.org/abs/2502.01235 (ICML 2025 Oral Presentation) Reference code: https://github.com/YuanheZ/LoRA-One
Content Overview
This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately—applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation.
Main Contributions
We theoretically prove:
- standard LoRA will align to the top-r subspace of first-step full gradient;
- LoRA can achieve fast linear convergence both in optimization and generalization if we initialize LoRA using best r-rank first-step full gradient.
Grounded by our theory, we establish the optimal initialization making use of gradient, clarifying the suboptimality of previous graident-based methods such as LoRA-GA, LoRA-SB. Our method is supported by performance improvement in a wide range of instruction, math, code benchmarks.
Algorithmic Overview
For each weight matrix, we first compute the gradient $\nabla_{W} L$ under full fine-tuning using a batch and perform SVD on $-\nabla_{W} L$ to get $U$, $\Sigma$, $V$, then we initialize LoRA via
\mathbf{A}_{0}=\frac{1}{\sqrt{\gamma}} U_{[:,:r]} Diag(S[:r])\,,\quad \mathbf{B}_{0}=\frac{1}{\sqrt{\gamma}} Diag(S[:r]) V_{[:,:r]}^\top\,,\quad W_{adapted} = W_{pre}+\frac{\alpha}{\sqrt{r}}\mathbf{A}_{0} \mathbf{B}_{0}\,,
which is equivalent to perform one best r-rank full gradient descent under full fine-tuning with learning rate $\frac{\alpha}{\gamma\sqrt{r}}$ at the initialization. The SVD is implemented by random SVD, which is super efficient.
Experiments
Your contribution
The code implementation is similar to PiSSA and LoRA-GA. The core idea is to replace the random init LoRA adapters with matrices from SVD. One additional need is the first-step full gradient compution, which has been implemented by a custom PEFT version in LoRA-GA. Welcome any suggestions or guidance on this.
Hi @YuanheZ , I've read through your work and found it quite interesting. I'd be happy to author a PR to integrate it into PEFT, if you're okay with that and not planning to do it yourself. Looking forward to your reply !
Thanks for bringing this to our attention @YuanheZ. I haven't checked the details, but please correct me if I misunderstand: For LoRA-One, we would also need to implement the technique from LoRA-GA in order to produce the gradient approximation without having to perform an actual full fine-tuning step (otherwise, it would defeat the purpose of being memory efficient), is that right?
I'd be happy to author a PR to integrate it into PEFT, if you're okay with that and not planning to do it yourself. Looking forward to your reply !
Thanks for the offer @sambhavnoobcoder, let's wait for @YuanheZ's response.
Hi @sambhavnoobcoder,
Thank you so much for your kind words and for taking the time to review my work — I really appreciate your interest! I’m actually planning to handle the integration into PEFT myself for practice, so I’d prefer to proceed with it on my end.
Thanks again for the generous offer and your support!
Hello @BenjaminBossan,
I think your understanding of the gradient approximation is right. Since LoRA-One needs to use the first-step gradients from full fine-tuning, we need the efficient approach from LoRA-GA to calculate them. I think GA has provided the code for that. The key part is the estimate_gradient function in https://github.com/Outsider565/LoRA-GA/blob/main/peft/src/peft/utils/lora_ga_utils/lora_ga_utils.py, which I'm not sure it is okay to merge into PEFT. The left part is similar to PiSSA, take SVD to gradients and set LoRA adpters to be them, but without modifying the base weights.
Much appreciated for any guidance on this!
Fair enough @YuanheZ ! I won’t lie, I was really looking forward to coding this up myself for some weekend fun 😄. But I completely get it — best of luck with your implementation, and happy to lend a hand anytime!
I think your understanding of the gradient approximation is right. Since LoRA-One needs to use the first-step gradients from full fine-tuning, we need the efficient approach from LoRA-GA to calculate them. I think GA has provided the code for that. The key part is the estimate_gradient function in https://github.com/Outsider565/LoRA-GA/blob/main/peft/src/peft/utils/lora_ga_utils/lora_ga_utils.py, which I'm not sure it is okay to merge into PEFT.
IMO the ideal path forward would be to get LoRA-GA into PEFT first and then build LoRA-ONE on top of that. I checked the LoRA-GA repo and they don't have any license. Therefore, we cannot simply copy the code from there. However, the authors there might be receptive to adding a license or maybe even to create a PR to add LoRA-GA to PEFT. Maybe @sambhavnoobcoder you would like to take the lead on that? Of course, if we implement it in PEFT, some changes may be needed, we will have to see.
yes , thank you @BenjaminBossan . I'll gladly take it up . i'll initiate the process contact the authors there for their plans and permission , and once i have a green flag from their end , i'll open up a PR here for the same right away .
Hi @BenjaminBossan , Updating on that LoRA-GA integration , i have opened a PR #2926 as well as an issue #2927 to track it . Hoping to coordinate it out and bring this to peft soon .