peft
peft copied to clipboard
Can I use Lora fine-tuning twice?
I’m planning to work with a two-stage LoRA fine-tuning pipeline (Stage 1: SFT with code completion outputs; Stage 2: SFT with full-code outputs; RL follows). My question is: When I continue training the same LoRA adapter in Stage 2, will I risk overwriting or degrading the knowledge learned during Stage 1 ? In other words, does continuing on the same adapter effectively preserve the Stage 1 capabilities, or should I be using a separate adapter (or merging strategy) to ensure both sets of skills remain intact? Thank you for any guidance or best‐practice pointers!
This is a difficult question to answer, as it depends on a lot of factors. If you have the resources, I would test both approaches. However, I think it's likely that the 2nd stage would interfere with the first stage if you don't include an objective that preserves the learning from the first stage.
If you want to avoid handling two adapters, what you could do is merge the first adapter after stage 1 and create a new one in stage 2. Another advantage of this approach is that you can use different hyper-parameters for the 2nd adapter (e.g. different rank and alpha). If you use the same adapter, you run the risk that the LoRA hyper-parameters that work best for stage 1 are not optimal for stage 2.
This is a difficult question to answer, as it depends on a lot of factors. If you have the resources, I would test both approaches. However, I think it's likely that the 2nd stage would interfere with the first stage if you don't include an objective that preserves the learning from the first stage.
If you want to avoid handling two adapters, what you could do is merge the first adapter after stage 1 and create a new one in stage 2. Another advantage of this approach is that you can use different hyper-parameters for the 2nd adapter (e.g. different rank and alpha). If you use the same adapter, you run the risk that the LoRA hyper-parameters that work best for stage 1 are not optimal for stage 2.
Thank you for the detailed explanation!
In my case, I initially tried training on a dataset of incomplete code → completed code pairs, but the results were not very good. So I changed my approach — I provided strong supervision signals and explicit labels, so that the model would only output the missing code fragments that need to be filled in. This worked surprisingly well.
However, my final goal is to make the model generate the entire completed code, not just the missing parts. That’s why I’m considering a second fine-tuning stage.
Regarding your suggestion about merging the LoRA adapter with the base model before starting Stage 2 — would this merging process risk losing what the model learned in Stage 1 (i.e., its ability to generate correct completions)? My intuition is that if I merge and then fine-tune a new LoRA adapter, the model might actually build upon the completion knowledge learned in Stage 1, potentially improving the quality of the full-code outputs. Do you think this reasoning makes sense?
Regardless of whether you merge the first adapter and then train a second adapter, or if you continue training on the first adapter, there is always the risk that the model will forget something that it should remember. This is an inherent risk of fine-tuning, no matter what approach. Therefore, I think for your goal, this makes no difference.
If you really want to prevent forgetting, I would suggest that you augment your evaluation pipeline to include the metrics on the task from stage 1, this way you can monitor if there is any forgetting. In case you see that the model forgets, you can:
- Change some hyper-parameters to hopefully prevent this (e.g. different learning rate, LoRA rank, etc.)
- Augment the objective in stage 2 to also include the completion task from stage 1 in your loss.
OFT help in your case ? https://huggingface.co/docs/peft/conceptual_guides/oft
better to only training on lora twice. maybe second time, you can apply OFT, but not sure does peft fully support without any error ~ ?
but not sure does peft fully support without any error ~ ?
You can train with one PEFT method (say, LoRA), merge the adapter (merge_and_unload()), then train with a different PEFT method (say, OFT). This is because after merging, the model just behaves like a normal base model. Whether OFT is a good fit for this particular problem is a different question, you would have to try it out.
What is not recommended is to use two different PEFT methods without merging. It might technically work but can cause problems down the line.
thank you so much, this method would be quite helpful !
@tohokulgq If you have no further questions, feel free to close the issue. In case you find some results worth sharing with the community, you can post them here too.