peft
peft copied to clipboard
Add INC dispatcher
What does this PR do?
Adds Intel Neural Compressor (INC) dispatcher to support loading LoRA into INC quantized model.
Intel Neural Compressor modifies layer names or introduces new layers during quantization. Names can be seen here (INC adds prefix “Patched“). HuggingFace PEFT does not have a proper dispatcher to deal with the new names.
Before this PR we get the following error when trying to load LoRA weights into INC quantized model:
Traceback (most recent call last):
File "/root/./test.py", line 124, in <module>
pipeline.load_lora_weights("lora_model", adapter_name="user_lora")
File "/usr/local/lib/python3.10/dist-packages/diffusers/loaders/lora_pipeline.py", line 1846, in load_lora_weights
self.load_lora_into_transformer(
File "/usr/local/lib/python3.10/dist-packages/diffusers/loaders/lora_pipeline.py", line 1948, in load_lora_into_transformer
inject_adapter_in_model(lora_config, transformer, adapter_name=adapter_name, **peft_kwargs)
File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 76, in inject_adapter_in_model
peft_model = tuner_cls(model, peft_config, adapter_name=adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)
File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py", line 142, in __init__
super().__init__(model, config, adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)
File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 181, in __init__
self.inject_adapter(self.model, adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)
File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 509, in inject_adapter
self._create_and_replace(peft_config, adapter_name, target, target_name, parent, current_key=key)
File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py", line 237, in _create_and_replace
new_module = self._create_new_module(lora_config, adapter_name, target, device_map=device_map, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py", line 348, in _create_new_module
raise ValueError(
ValueError: Target module PatchedLinear(
original=Linear, in_features=3072, out_features=3072, bias=True, scale_input dtype=float, scale_weight dtype=float
(quant_input): QuantInput(lp_dtype=torch.float8_e4m3fn, hp_dtype=torch.bfloat16, scale_inv dtype=float)
(dequant_output): QuantDequantNone(lp_dtype=torch.float8_e4m3fn, hp_dtype=torch.bfloat16, doesn't quantize nor dequantize)
) is not supported.
Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv1d`,
`torch.nn.Conv2d`, `torch.nn.Conv3d`, `transformers.pytorch_utils.Conv1D`, `torch.nn.MultiheadAttention.`.
After this PR, LoRA weights are loaded into INC quantized model without any issues, and performance of the model is similar to that of original quantized model ✅
Thanks for the PR to support INC. Could you please provide a short example to test this out? I assume some features like merging of LoRA weights would not work yet?
@BenjaminBossan Thanks for the review. A helloworld type of example can be found at Intel Neural Compressor (INC) repo, but this does not have lora loading example.
To actually test this PR with PEFT, here's an example of FLUX model inference on HPU via optimum-habana pipeline (his demonstrates loading a PEFT-tuned model with FP8 quantization):
#!/usr/bin/env python
import os
import torch
# Example: FLUX model inference on HPU via optimum-habana pipeline
from optimum.habana.diffusers import GaudiFluxPipeline
hpu_configs = {
"use_habana": True,
"use_hpu_graphs": True,
"sdp_on_bf16": True,
"gaudi_config": "Habana/stable-diffusion",
}
pipe = GaudiFluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, **hpu_configs)
prompt = "A picture of sks dog in a bucket"
# Quantize FLUX transformer to FP8 using INC
from neural_compressor.torch.quantization import FP8Config, convert, prepare, finalize_calibration
quant_configs = {
"mode": "AUTO",
"observer": "maxabs",
"scale_method": "maxabs_hw",
"allowlist": {"types": [], "names": []},
"blocklist": {"types": [], "names": []},
"dump_stats_path": "/tmp/hqt_output/measure"
}
config = FP8Config(**quant_configs)
pipe.transformer = prepare(pipe.transformer, config)
pipe(prompt)
finalize_calibration(pipe.transformer)
pipe.transformer = convert(pipe.transformer)
# Load LoRA weights with PEFT
pipe.load_lora_weights("dsocek/lora-flux-dog", adapter_name="user_lora") # <--- FAILS WITHOUT INC DISPATCHER
# Run inference
image = pipe(prompt).images[0]
image.save("dog.png")
Requirements:
pip install optimum-habana sentencepiece neural-compressor[pt] peft
I assume some features like merging of LoRA weights would not work yet?
We could address merging and other additional features in a different PR, if that's ok
Thanks for providing the example. I have a small comment. Could you please also run
make style?I think it would be great to add it to
peft/examples/stable_diffusion/. Let's also add a check at the top of the script if the required hardware is present, + a small description what the script does.
Sounds good, I will fix style with make style in next commit. Are you thinking to add a standalone HPU INC quantized Flux LoRA inference example py file, or somehow integrate into existing example script? A standalone can be done quickly. Do you also want to have example with and without quantization (somehow control via args or so)?
A check is_hpu_available() was recently added to accelerate (see here), should we consider adding similar check in peft/import_utils and just call this from the example?
We could address merging and other additional features in a different PR, if that's ok
Generally, that would be fine, we have other quantization without merging support. However, right now, the implementation dispatches to the vanilla LoRA layers, so if a user calls
merge, the normal merge method is called, which would result in an error (I assume, I haven't tested) or may even fail silently. This is not very user friendly. IMO it would be better to dispatch to special INC layers, similar to what we do with other quantization methods. The layers would not implementmergeandunmerge, or raise aNotImplementedError, so that users know what's going on. Later, it will be easier to add those methods to the INC layers. WDYT?
I see, so you would like to add specific INC LoRA layers which define merge and unmerge (like say in HQQ) at least as placeholders with NotImplementedError, correct?
I am not sure when/how is merge/unmerge called with diffusers (sorry for my misunderstanding here) to even test what happens currently :) I know we can call something like pipe.set_adapters(["user_lora", "user_lora2"], adapter_weights=[0.7, 0.8]) for example to "merge" 2 adapters but I think maybe this is not the same merge/unmerge on layer level. I also know one can use pipe.fuse_lora() which will fuse lora with actual model (but we can't unfuse then), will this trigger merge?
Are you thinking to add a standalone HPU INC quantized Flux LoRA inference example py file, or somehow integrate into existing example script? A standalone can be done quickly. Do you also want to have example with and without quantization (somehow control via args or so)?
Standalone is fine.
A check
is_hpu_available()was recently added toaccelerate(see here), should we consider adding similar check inpeft/import_utilsand just call this from the example?
Normally, we would just use accelerate, but since it's so recent, I think it's better to copy+paste with a comment about the source.
I see, so you would like to add specific INC LoRA layers which define
mergeandunmerge(like say in HQQ) at least as placeholders withNotImplementedError, correct?
Yes, that's the suggestion.
I am not sure when/how is merge/unmerge called with diffusers (sorry for my misunderstanding here) to even test what happens currently :)
It looks like it's used here: https://github.com/huggingface/diffusers/blob/026507c06cdabfb0c13ddeb1ac33f5d8e244361f/src/diffusers/loaders/peft.py#L698-L737
But even if it weren't: This is not exclusive to diffusion models, right? People could use INC for normal LLMs and then later may want to merge the weights, for instance for faster inference. So having the option would be good in the long run, but not necessary for a start. If you need help with testing, LMK.
@BenjaminBossan Thanks for quick reply and clarification. All sounds good. I will implement the placeholders and should be able to test them via merge/unmerge with an LLM case, rather than diffusers case.
Normally, we would just use accelerate, but since it's so recent, I think it's better to copy+paste with a comment about the source.
I assume you mean copy paste this function directly in the example script and not add to peft/import_utils, correct?
I will implement the placeholders and should be able to test them via merge/unmerge with an LLM case, rather than diffusers case.
Thanks, sounds good.
I assume you mean copy paste this function directly in the example script and not add to
peft/import_utils, correct?
If it's only needed there, then yes.
@BenjaminBossan Thanks again for quick turnaround!
I updated PR with 2 more commits, but basically:
- added INC linear layer class with merge/unmerge placeholders
- added new example in
examples\stable-diffusion
Additionally, I did test both of these. The new example works without issues on my side. For the placeholders I creates a small LLM test case and I see both placeholders work as expected:
merge() test:
...
2025-04-24 14:13:32 [INFO][quantize.py:226] Start to convert model with fp8_quant.
2025-04-24 14:13:34 [INFO][test_llm.py:69] Conversion end.
Traceback (most recent call last):
File "/root/./test.py", line 80, in <module>
model = model.merge_and_unload() # Attempt to merge LoRA weights
File "/root/peft/src/peft/tuners/lora/model.py", line 903, in merge_and_unload
return self._unload_and_optionally_merge(
File "/root/peft/src/peft/tuners/lora/model.py", line 533, in _unload_and_optionally_merge
target.merge(safe_merge=safe_merge, adapter_names=adapter_names)
File "/root/peft/src/peft/tuners/lora/inc.py", line 46, in merge
raise NotImplementedError("Merging LoRA with INC layers is not yet implemented")
NotImplementedError: Merging LoRA with INC layers is not yet implemented
unmerge() test:
...
2025-04-24 14:12:30 [INFO][quantize.py:226] Start to convert model with fp8_quant.
2025-04-24 14:12:32 [INFO][test_llm.py:69] Conversion end.
Traceback (most recent call last):
File "/root/./test.py", line 81, in <module>
model = model.unmerge_adapter() # Attempt to unmerge LoRA weights
File "/root/peft/src/peft/tuners/tuners_utils.py", line 611, in unmerge_adapter
module.unmerge()
File "/root/peft/src/peft/tuners/lora/inc.py", line 52, in unmerge
raise NotImplementedError("Unmerging LoRA from INC layers is not yet implemented")
NotImplementedError: Unmerging LoRA from INC layers is not yet implemented
Let me know if any more changes are needed to this PR.
I assume that to use INC, I would need the correct hardware, i.e. I wouldn't be able to test this locally or on the PEFT CI
Yes, you would need Intel Gaudi (HPU) for this
Would the plan be that Intel is running its own CI to check that it works? In that case, we should add a unit test, even if it's just a very functional one. For instance, we could copy the bnb test here and make adjustments for INC. Of course, the test should be skipped if the required packages and/or hardware are not available. Intel could then run this test on their CI and inform us if there is any regression in PEFT that breaks the integration.
Yes we can easily add such a test in Optimum-Habana, which uses PEFT in a number of samples. I am contributing to OH actively so should be possible to add a CI test which will run this or similar example and if there is future failure it will be caught quickly and addressed. We already have a number of PEFT related tests in OH for llms or diffusers.
@BenjaminBossan I fixed the small issue you indicated, see my earlier comment for your other concern. Thanks again for quick and thorough review!
Also I did a final test and the updated example works on HPU as before, and on a system without HPU device now we get error as expected:
$ python inc_flux_lora_hpu.py
2025-04-25 20:43:06 [WARNING][auto_accelerator.py:418] Auto detect accelerator: CUDA_Accelerator.
Traceback (most recent call last):
File "/root/peft/examples/stable_diffusion/inc_flux_lora_hpu.py", line 35, in <module>
raise RuntimeError("HPU device not found. This code requires Intel Gaudi device to run.")
RuntimeError: HPU device not found. This code requires Intel Gaudi device to run.
@BenjaminBossan Thanks again for a thorough review. I incorporated your final suggestions into the PR:
- added note about external handling of CI tests to
inc.py - added section on
INC quantizationin the docs (with caveat subsection)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.