peft [Warning] `Merge lora module to 4-bit linear may get different generations`

System Info

peft 0.14.0 transformers 4.48.0 bitsandbytes 0.45.0

Who can help?

@BenjaminBossan @sayakpaul

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder
[ ] My own task or dataset (give details below)

Reproduction

code:


base_model_id = "gemma-2-27b-it"

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_storage=torch.bfloat16,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=quantization_config,
    attn_implementation="sdpa",
    torch_dtype=torch.bfloat16,
    use_cache=True,
)

peft_model = PeftModel.from_pretrained(base_model, adapter_path)

--> merged_model = peft_model.merge_and_unload()

Warning:


UserWarning: Merge lora module to 4-bit linear may get different generations due to rounding errors.

Expected behavior

merge_and_unload() correctly and without warning.

Jan 11 '25 20:01 steveepreston

There is no way to avoid this. When merging weights in such a low precision regime, rounding errors are unavoidable. We give this warning purely to make users aware of that fact, not because they did anything wrong.

What you can try for your use case is to load the base model without quantization, merge the LoRA weights into the unquantized model, and then quantize the merged model to 4 bit. Please verify afterwards whether this gives better results or not. But the overall issue with low precision will still remain.

Jan 13 '25 10:01 BenjaminBossan

Hey @BenjaminBossan

do you mean this?:

Load Quantized Model
SFT Train
Save Adapters
Load Unquantized Model
Merge and unload
Save Merged Model
Inference Quantized Merged

Jan 13 '25 15:01 steveepreston

Exactly, just make sure in step 7 that you load the merged model with the intended quantization applied. Some users have reported that this yields better results for them than merging into the quantized weights. But please verify that this is true for your use case (and report back please!).

Jan 13 '25 16:01 BenjaminBossan

@BenjaminBossan Thanks for note.

I will try and report here.

Jan 13 '25 16:01 steveepreston

Step 7. can significantly degrade the results compared to just loading the LoRA adapter on top of the quantized model. That's because the merged LoRA weights are quantized. I found that quantizing the model at step 6. with AWQ or GPTQ works significantly better than with bitsandbytes.

Feb 06 '25 14:02 benjamin-marie

Thanks for sharing your findings @benjamin-marie. It is true that merging will degrade precision, but it improves runtime performance, so it's a trade off.

I found that quantizing the model at step 6. with AWQ or GPTQ works significantly better than with bitsandbytes.

Do you mean without step 7 (merging) or do you mean that AWQ and GPTQ are better when merging the LoRA weights?

Feb 06 '25 15:02 BenjaminBossan

Load Quantized Model (bitsandbytes)
SFT Train
Save Adapters
Load Unquantized Model
Merge and unload
Save Merged Model

I agree that all these steps are correct and yield a model that should perform the same as the adapter obtained at the end of SFT. But then, this model is much larger than the one fine-tuned. Intuitively, quantizing it again with bnb could be optimal since we used it during SFT. In my experiments, it can severely degrade the results. I don't know why. Maybe after merging, the weight distribution is difficult to handle for bnb.

However, quantizing the merged model with GPTQ or AWQ instead of bnb usually yields better results (perplexity) much closer to the unquantized merged model.

Feb 07 '25 09:02 benjamin-marie

Thanks for explaining further. This is probably a topic we should explore further, as it has come up a few times in the past. Ideally, we can collect some best practices and share them in the docs. I'm very interested in running some experiments with different steps and quantization techniques. If you have any code to share (or checkpoints), please feel free to do so.

Feb 07 '25 10:02 BenjaminBossan

My experiments are almost one year old. I'll rerun some experiments with the updated packages and reevaluate everything. And I'll share the results and a notebook.

Feb 07 '25 11:02 benjamin-marie

Fantastic, thanks a lot!

Feb 07 '25 11:02 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Mar 03 '25 15:03 github-actions[bot]

I'll post an update this week about this. Probably tomorrow.

Mar 03 '25 15:03 benjamin-marie

@benjamin-marie Nice

Mar 06 '25 10:03 steveepreston

I confirm that quantizing the model with bitsandbytes after merging significantly degrades its quality.

So far, I've only experimented with merging QLoRA adapters. It seems that accurately quantizing a model with a merged QLoRA is much more challenging than quantizing a standard model (i.e., one that hasn't been merged). I even suspect that merging a LoRA adapter in general, i.e., not only QLoRA, could significantly alter the weight distribution, potentially increasing the number of outlier parameters, which would make the model much more difficult to accurately quantize. To the best of my knowledge, this is understudied.

I have much better results when quantizing, after merging, with AWQ instead of bnb, which is somewhat counterintuitive since during QLoRA fine-tuning the model was quantized with bnb.

You can check my experiment in the notebook here: https://colab.research.google.com/drive/1MXPiZRHlojSQGUIrnbcqWy-rmlWNX9bp?usp=sharing

Mar 06 '25 12:03 benjamin-marie

Thanks for sharing your results @benjamin-marie. Would you be interested in updating the PEFT docs to include your results with a link to the experiment? Of course, this is just a single test, but it could still be helpful to many users.

Could you please explain what the custom code does and how it differs?

Also, did you test: Load the unquantized model, merge the adapter, then quantize it with bnb? Maybe that's your 2nd to last entry in the table, I'm not sure.

Mar 06 '25 15:03 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Mar 31 '25 15:03 github-actions[bot]

gentle ping @benjamin-marie

Mar 31 '25 15:03 BenjaminBossan

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Apr 25 '25 15:04 github-actions[bot]

I had a similar issue recently, working with SigLIP and my AUC dropped 6 points.

I don't believe the 4bit layer should be storing the weights quantised during merge_and_unload, due to the fact that during training and inference we compute in a different dtype, i.e. I used BF16. Therefore we are providing different results with an error message which doesn't really explain the issue.

So to match the expected results of an unloaded adapter we should swap the layer out for a Linear variant and keep the merged weights in the compute type to avoid users having these unexpected drops in performance.

This way it's more compositional as users can then go onto quantize downstream in either an aware manner or naive manner but it would be significantly easier to debug as the pipeline is more at the top level in the users code than hidden within numerous layers (sometimes even inherited layers).

Jul 29 '25 16:07 joeyearsley