Feature Request: Add Adaptive Singular Value Decomposition based Orthogonal Subspace Fine-Tuning
Feature request
We propose adding a new parameter-efficient fine-tuning method based on adaptive singular value decomposition (SVD) for continual learning in LLMs. The core idea is to decompose weight matrices into high-rank and low-rank subspaces and constrain updates only within the low-rank subspace while freezing the high-rank directions, effectively preventing catastrophic forgetting in LLMs.
Method details
This method performs an SVD decomposition of each weight matrix into orthogonal components: $\mathbf{W} = \mathbf{U} \Sigma \mathbf{V}^\top$
We freeze the top-$k$ singular directions in both $\mathbf{U}$ and $\mathbf{V}$ which correspond to subspaces encoding knowledge from previously learned tasks and fine-tune only the remaining low-rank directions. This allows us to repurpose unused capacity in the weight matrix without interfering with critical past representations.
Formally, for a matrix $\mathbf{W} \in \mathbb{R}^{n \times n}$:
- The high-rank (frozen) subspace is size
r:U_high ∈ ℝⁿˣʳ,V_high ∈ ℝⁿˣʳ - The low-rank (trainable) subspace is size
n - r:U_low ∈ ℝⁿˣ⁽ⁿ⁻ʳ⁾,V_low ∈ ℝⁿˣ⁽ⁿ⁻ʳ⁾
Total compute (time complexity):
- $2nr + 2n(n-r) = 2n^2$ multiplications (twice as dense linear layer)
Memory complexity:
- $2n^2$ (matrix size after SVD)
- $2n(n - r)$ (gradients of trainable parameters)
- $4n(n - r)$ (optimizer state for trainable parameters)
Compared to full fine-tuning (which uses $4n^2$), our method is memory-efficient as long as we freeze at least $\frac{2}{3}$ of the weight matrix. This makes it a practical alternative for continual learning and multi-task adaptation without additional memory or parameter overhead.
Compared to full fine-tuning (which uses $4n^2$), our method is memory-efficient as long as we freeze at least $\frac{2}{3}$ of the weight matrix. Unlike methods like LoRA it introduces no additional parameters after training, we reconstruct the original matrix thus preserving the exact architecture and parameter count. This makes it a practical alternative for continual learning and multi-task adaptation without extra memory or parameter overhead.
Paper: Sculpting Subspaces: Constrained Full Fine-Tuning in LLMs for Continual Learning Code: https://github.com/NikhilNayak-debug/mini_trainer
Why this fits well in PEFT:
- It fine-tunes only part of the SVD-decomposed matrix (low-rank subspace), keeping high-rank components frozen.
- It avoids task-specific parameter growth and preserves memory efficiency.
- It provides strong continual learning performance with fixed model size.
Your contribution
We have implemented the method and opened a PR with:
- Core logic under
svd_utils.py - Integration into PEFT via
wrap_model_with_svd - A usage example in
examples/orthogonal_subspace_learning - Initial tests in
tests/test_svd_utils.py
Looking forward to your feedback on incorporating this into PEFT.
Hey @NikhilNayak-debug, thanks for the suggestion and willingness to implement adaptive SVD for PEFT.
I skimmed the paper and your draft PR and I think that this would make a fine addition to PEFT. Let's start integrating! It's probably best if you open a new PR on the PEFT repo instead your private fork itself so that we can collaborate there more effectively.
Let's not use "SVD" as a method name but something that can be less mistaken for good old singular value decomposition, seeing utils/svd_utils had me do a double take. ASVD or OSF (orthogonal subspace fine-tuning) perhaps?
It would be best to have a tuner class like all other tuners in src/peft/tuners/ - when you settled for a name you should base your implementation there, e.g. src/peft/tuners/asvd/. You can base your implementation on something existing like src/peft/tuners/lora (or trainable tokens which might be a bit leaner). The model you'll define would be the equivalent to your ModelWithSVD and the config class would a formalization of the svd_config object. The goal is to remove wrap_model_with_svd and replace it with something like this to utilize the common API:
from peft import AsvdConfig, get_peft_model
base_model = ...
peft_model = get_peft_model(base_model, AsvdConfig(target_modules=[...])
This would also make it possible to increase test coverage by adding a test to test_custom_models.py which will test things like model saving, loading, instantiation and compatibility with a basic range of layers, a good foundation for development. We may have to add exceptions for functionality like merging/unmerging since I don't think this method will support these operations (correct me if I'm wrong). Later on I think we could do some more model-specific tests in the test file that you already provided if we find something that needs additional testing.
I thought about the optim_wrapper, do you think it can be replaced by using gradient hooks so that the user doesn't have to wrap the optimizer?
In the paper we're taking the cosine similarity of layer inputs and activations to compute the layer importance, am I missing something or is this not done in the draft implementation?
Do you think that it would be possible to copy the decomposed low-rank values instead of modifying them in-place so that we don't lose the ability to swap between tuner adapters?
Hey @githubnemo, great to see this moving forward! Really appreciate the thoughtful feedback and your willingness to consider integrating this into PEFT. We’re from the same team at Red Hat AI Innovation that contributed the recent SQuat PR on KV cache quantization to Transformers, so it’s exciting to continue building on efficient fine-tuning and inference methods across both repos. Thanks again for your time and support--it’s great to collaborate with maintainers so open to community contributions!
Thanks so much @githubnemo these are great suggestions, really appreciate the detailed feedback. We will go ahead and make the changes as you outlined.
A couple of points to clarify and get your thoughts on:
-
Unified LoRA implementation? Do you think it makes sense to reuse the LoRA code path for this method? Conceptually, it is quite similar: in our case, the original matrix becomes the frozen high-rank subspace, and the trainable adapter corresponds to the low-rank subspace. Like LoRA does $W = W_{\text{orig}} + W_{\text{new}}$, we are doing $W = W_{\text{high}} + W_{\text{low}}$, where $W_{\text{low}} = U \Sigma V^T$ corresponding to the bottom singular values can be mapped to LoRA's $B \times A$. The main additional step is to enforce orthogonality after each gradient update. Should we unify it with the existing LoRA implementation, or keep it as a separate module to maintain cleaner separation and modularity?
-
On effective rank estimation We found that using a predetermined budget for the low-rank subspace (based on the number of fine-tuning stages) works just as well in practice as more complex rank estimation methods. That's why we currently don't compute layerwise importance using cosine similarity in the draft implementation but we can explore adding this back as an option if needed.
Thanks so much @githubnemo these are great suggestions, really appreciate the detailed feedback. We will go ahead and make the changes as you outlined.
A couple of points to clarify and get your thoughts on:
1. **Unified LoRA implementation?** Do you think it makes sense to reuse the LoRA code path for this method? Conceptually, it is quite similar: in our case, the original matrix becomes the frozen high-rank subspace, and the trainable adapter corresponds to the low-rank subspace.
If your question is to implement OSF as a variant, I don't think it is a good fit. While I agree that you can lay it out to be looking quite similar, I'd think of a variant as basically the same thing but one aspect changed. We have a few aspects that have changed: we're 'modifying' the base weights by decomposing, we're aligning the gradients and even though we're currently not weighing in the effective rank estimation from the paper it could very well be that we do in the future. We can use LoRA as a base for the new tuner implementation, for sure though.
Hello @githubnemo,
Thanks again for the helpful feedback and suggestions. We have made the changes you recommended and opened the PR here: PR Link
This includes:
- Moving the implementation to a new
osftuner class following the standard PEFT repository structure - Removing
wrap_model_with_svdin favor ofget_peft_model(OSFConfig)following PEFT's standard API - Replacing the optimizer wrapper with backward hooks for gradient projection
- Preserving in-place decomposition of the matrix
We still need to add tests to validate everything works as expected. Here is the checklist we are planning to cover, please let us know if there is anything else you would suggest:
- [x] Add OSF model tests to
test_custom_models.py(save/load, instantiation, basic forward pass) - [x] Check compatibility with
nn.Linear.Convlayers are currently unsupported but can be added later for vision models. - [x] Add a test to verify gradient projection logic
- [x] Validate config save/load
- [x] Ensure merge/unmerge behavior is explicitly unsupported and gracefully handled
Thank you again looking forward to your feedback!
Hello @githubnemo , thank you again for your guidance so far.
The current PR ensures that the core OSFT functionality works and follows a unified implementation structure consistent with other PEFT methods in the repo. We have implemented the tuner using the standard get_peft_model flow, replaced the optimizer wrapper with backward hooks, and verified that the in-place decomposition works as expected.
In parallel, we have been polishing a more robust version of OSFT in our research repo mini_trainer, which includes enhancements that are independent of PEFT integration:
- Support for OSFT with FSDP
- Optimized SVD initialization for distributed environments
- Mixed-precision SVD computation for reduced numerical error
- Generalized support for custom layer patterns across different model families (not just LLaMA)
We wanted to finalize and validate the base implementation within PEFT first including adding comprehensive tests as outlined in the previous comment, before incorporating these robustness improvements.
Would love your feedback on the current implementation and the plan moving forward. Please let us know if there's anything you would recommend adjusting at this stage. Thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
We are actively working with the HuggingFace team to merge this method into the PEFT repository. We intend to keep this issue active until the integration is complete.