Add LoRA-MPO integration for enhanced parameter efficiency
This is a PR for parameter-efficient fine-tuning method, MPOP[1], which introduces Matrix Product Operator (MPO) integration with LoRA (we call lorampo here) to improve parameter efficiency and training stability.
Key changes:
- Implement
lorampomethod in MLP layers using MPO-based initialization - Add lora_mpo configuration option to LoraConfig
- Update training scripts and utilities to support training with
lorampo - Add example training script for
lorampoexperiments
Features:
- Integration with existing LoRA infrastructure
- Support for MPO-based weight initialization
- Backward compatibility with standard LoRA
This enhancement allows users to leverage MPO decomposition for more efficient parameter adaptation while maintaining the simplicity of LoRA usage.
[1] Liu et al. Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators. ACL 2021
For your convenience, here are some friendly points to help with a quick check.
(1) How to quickly test this PR
You can directly test the lorampo script via peft/examples/sft/run_peft_mpo.sh.
Simply modify the following two lines:
export model_path="YOUR_MODEL_PATH" # e.g., "Qwen/Qwen3-0.6B"
export output_dir="./" # e.g., "./checkpoints"
Then run the script to verify functionality.
(2) What tests have been done
We validated the PR under the following setting: • Model: Qwen3-0.6B • Dataset: smangrul/ultrachat-10k-chatml • Training: 1 epoch fine-tuning Lora results:
[A{'eval_loss': 1.738003134727478, 'eval_runtime': 83.4676, 'eval_samples_per_second': 21.949, 'eval_steps_per_second': 2.744, 'eval_entropy': 1.7323376929395584, 'eval_num_tokens': 8800671.0, 'eval_mean_token_accuracy': 0.5938055174319504, 'epoch': 1.0} {'train_runtime': 1227.0524, 'train_samples_per_second': 7.41, 'train_steps_per_second': 0.117, 'train_loss': 1.7760656763623643, 'epoch': 1.0}
Ours results:
[A{'eval_loss': 1.7559425830841064, 'eval_runtime': 57.0015, 'eval_samples_per_second': 32.139, 'eval_steps_per_second': 4.017, 'eval_entropy': 1.7077645209158352, 'eval_num_tokens': 8800671.0, 'eval_mean_token_accuracy': 0.592191962956341, 'epoch': 1.0} {'train_runtime': 879.3554, 'train_samples_per_second': 10.341, 'train_steps_per_second': 0.163, 'train_loss': 1.792860364580488, 'epoch': 1.0}
Observation: Compared to lora, lorampo method consumes less time with similar performance.
Hi, could a maintainer please approve and run the pending workflows for this PR? They’re currently blocked with “2 workflows awaiting approval”. Thanks!
Thank you very much for taking the time to read it and for your thoughtful comments! Yes, I am the first author of the paper. Let me clarify the relation between MPOP and LoRA. 1. Historical context and conceptual relation: MPOP was proposed slightly earlier than LoRA and falls into the same family of parameter-efficient tuning methods. The key idea of MPOP is to introduce a structure-agnostic low-dimensional adaptation without modifying the model architecture, which makes it highly deployment-friendly. 2. Mathematical formulation: While LoRA constrains parameter updates to a low-rank subspace using \Delta W = BA, MPOP decomposes the parameter matrix W into multiple smaller tensors through a tensor-network representation (we use 5 in the paper). Conceptually, LoRA can be seen as a special case of MPOP where the number of decomposition cores is 2. Therefore, expressing MPOP in a LoRA-style formulation is straightforward by viewing each MPO core as an intermediate low-rank projection. 3. Implementation plan: Since MPOP can be viewed as a LoRA variant, I fully agree that it makes sense to implement it under the LoraVariant subclass. I’ll refactor the implementation accordingly. 4. Code adjustments: • I’ll vendor the minimal code from matrix2mpo_plus directly into a new mpo_utils.py file to remove the external dependency. • I’ll translate all Chinese comments into English. • I’ll separate the example script instead of editing the existing one.
I’ll push these updates shortly. Thank you again for the clear guidance and for helping make this integration cleaner!
Thanks for answering my questions, your answers make sense.
MPOP decomposes the parameter matrix W into multiple smaller tensors through a tensor-network representation (we use 5 in the paper). Conceptually, LoRA can be seen as a special case of MPOP where the number of decomposition cores is 2
IIUC, this would correspond to n=2 in Eq. 1 of the paper, is that right? Do you have data on how well MPOP performs with n=2? I could only find Fig. 2b with the reconstruction error, but no experimental results.
I’ll push these updates shortly.
Thanks, looking forward to it.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.