[Warning] `Merge lora module to 4-bit linear may get different generations`
System Info
peft 0.14.0 transformers 4.48.0 bitsandbytes 0.45.0
Who can help?
@BenjaminBossan @sayakpaul
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder - [ ] My own task or dataset (give details below)
Reproduction
code:
base_model_id = "gemma-2-27b-it"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_storage=torch.bfloat16,
)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
quantization_config=quantization_config,
attn_implementation="sdpa",
torch_dtype=torch.bfloat16,
use_cache=True,
)
peft_model = PeftModel.from_pretrained(base_model, adapter_path)
--> merged_model = peft_model.merge_and_unload()
Warning:
UserWarning: Merge lora module to 4-bit linear may get different generations due to rounding errors.
Expected behavior
merge_and_unload() correctly and without warning.
There is no way to avoid this. When merging weights in such a low precision regime, rounding errors are unavoidable. We give this warning purely to make users aware of that fact, not because they did anything wrong.
What you can try for your use case is to load the base model without quantization, merge the LoRA weights into the unquantized model, and then quantize the merged model to 4 bit. Please verify afterwards whether this gives better results or not. But the overall issue with low precision will still remain.
Hey @BenjaminBossan
do you mean this?:
- Load Quantized Model
- SFT Train
- Save Adapters
- Load Unquantized Model
- Merge and unload
- Save Merged Model
- Inference Quantized Merged
Exactly, just make sure in step 7 that you load the merged model with the intended quantization applied. Some users have reported that this yields better results for them than merging into the quantized weights. But please verify that this is true for your use case (and report back please!).
@BenjaminBossan Thanks for note.
I will try and report here.
Step 7. can significantly degrade the results compared to just loading the LoRA adapter on top of the quantized model. That's because the merged LoRA weights are quantized. I found that quantizing the model at step 6. with AWQ or GPTQ works significantly better than with bitsandbytes.
Thanks for sharing your findings @benjamin-marie. It is true that merging will degrade precision, but it improves runtime performance, so it's a trade off.
I found that quantizing the model at step 6. with AWQ or GPTQ works significantly better than with bitsandbytes.
Do you mean without step 7 (merging) or do you mean that AWQ and GPTQ are better when merging the LoRA weights?
- Load Quantized Model (bitsandbytes)
- SFT Train
- Save Adapters
- Load Unquantized Model
- Merge and unload
- Save Merged Model
I agree that all these steps are correct and yield a model that should perform the same as the adapter obtained at the end of SFT. But then, this model is much larger than the one fine-tuned. Intuitively, quantizing it again with bnb could be optimal since we used it during SFT. In my experiments, it can severely degrade the results. I don't know why. Maybe after merging, the weight distribution is difficult to handle for bnb.
However, quantizing the merged model with GPTQ or AWQ instead of bnb usually yields better results (perplexity) much closer to the unquantized merged model.
Thanks for explaining further. This is probably a topic we should explore further, as it has come up a few times in the past. Ideally, we can collect some best practices and share them in the docs. I'm very interested in running some experiments with different steps and quantization techniques. If you have any code to share (or checkpoints), please feel free to do so.
My experiments are almost one year old. I'll rerun some experiments with the updated packages and reevaluate everything. And I'll share the results and a notebook.
Fantastic, thanks a lot!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I'll post an update this week about this. Probably tomorrow.
@benjamin-marie Nice
I confirm that quantizing the model with bitsandbytes after merging significantly degrades its quality.
So far, I've only experimented with merging QLoRA adapters. It seems that accurately quantizing a model with a merged QLoRA is much more challenging than quantizing a standard model (i.e., one that hasn't been merged). I even suspect that merging a LoRA adapter in general, i.e., not only QLoRA, could significantly alter the weight distribution, potentially increasing the number of outlier parameters, which would make the model much more difficult to accurately quantize. To the best of my knowledge, this is understudied.
I have much better results when quantizing, after merging, with AWQ instead of bnb, which is somewhat counterintuitive since during QLoRA fine-tuning the model was quantized with bnb.
You can check my experiment in the notebook here: https://colab.research.google.com/drive/1MXPiZRHlojSQGUIrnbcqWy-rmlWNX9bp?usp=sharing
Thanks for sharing your results @benjamin-marie. Would you be interested in updating the PEFT docs to include your results with a link to the experiment? Of course, this is just a single test, but it could still be helpful to many users.
Could you please explain what the custom code does and how it differs?
Also, did you test: Load the unquantized model, merge the adapter, then quantize it with bnb? Maybe that's your 2nd to last entry in the table, I'm not sure.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
gentle ping @benjamin-marie
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I had a similar issue recently, working with SigLIP and my AUC dropped 6 points.
I don't believe the 4bit layer should be storing the weights quantised during merge_and_unload, due to the fact that during training and inference we compute in a different dtype, i.e. I used BF16. Therefore we are providing different results with an error message which doesn't really explain the issue.
So to match the expected results of an unloaded adapter we should swap the layer out for a Linear variant and keep the merged weights in the compute type to avoid users having these unexpected drops in performance.
This way it's more compositional as users can then go onto quantize downstream in either an aware manner or naive manner but it would be significantly easier to debug as the pipeline is more at the top level in the users code than hidden within numerous layers (sometimes even inherited layers).