What does this PR do?

Adds Intel Neural Compressor (INC) dispatcher to support loading LoRA into INC quantized model.

Intel Neural Compressor modifies layer names or introduces new layers during quantization. Names can be seen here (INC adds prefix “Patched“). HuggingFace PEFT does not have a proper dispatcher to deal with the new names.

Before this PR we get the following error when trying to load LoRA weights into INC quantized model:

Traceback (most recent call last):
  File "/root/./test.py", line 124, in <module>
    pipeline.load_lora_weights("lora_model", adapter_name="user_lora")
  File "/usr/local/lib/python3.10/dist-packages/diffusers/loaders/lora_pipeline.py", line 1846, in load_lora_weights
    self.load_lora_into_transformer(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/loaders/lora_pipeline.py", line 1948, in load_lora_into_transformer
    inject_adapter_in_model(lora_config, transformer, adapter_name=adapter_name, **peft_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/peft/mapping.py", line 76, in inject_adapter_in_model
    peft_model = tuner_cls(model, peft_config, adapter_name=adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py", line 142, in __init__
    super().__init__(model, config, adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 181, in __init__
    self.inject_adapter(self.model, adapter_name, low_cpu_mem_usage=low_cpu_mem_usage)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/tuners_utils.py", line 509, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, current_key=key)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py", line 237, in _create_and_replace
    new_module = self._create_new_module(lora_config, adapter_name, target, device_map=device_map, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/peft/tuners/lora/model.py", line 348, in _create_new_module
    raise ValueError(
ValueError: Target module PatchedLinear(
  original=Linear, in_features=3072, out_features=3072, bias=True, scale_input dtype=float, scale_weight dtype=float
  (quant_input): QuantInput(lp_dtype=torch.float8_e4m3fn, hp_dtype=torch.bfloat16, scale_inv dtype=float)
  (dequant_output): QuantDequantNone(lp_dtype=torch.float8_e4m3fn, hp_dtype=torch.bfloat16, doesn't quantize nor dequantize)
) is not supported. 
Currently, only the following modules are supported: `torch.nn.Linear`, `torch.nn.Embedding`, `torch.nn.Conv1d`, 
`torch.nn.Conv2d`, `torch.nn.Conv3d`, `transformers.pytorch_utils.Conv1D`, `torch.nn.MultiheadAttention.`.

After this PR, LoRA weights are loaded into INC quantized model without any issues, and performance of the model is similar to that of original quantized model ✅

Apr 15 '25 23:04 dsocek

Thanks for the PR to support INC. Could you please provide a short example to test this out? I assume some features like merging of LoRA weights would not work yet?

Apr 16 '25 09:04 BenjaminBossan

@BenjaminBossan Thanks for the review. A helloworld type of example can be found at Intel Neural Compressor (INC) repo, but this does not have lora loading example.

To actually test this PR with PEFT, here's an example of FLUX model inference on HPU via optimum-habana pipeline (his demonstrates loading a PEFT-tuned model with FP8 quantization):

#!/usr/bin/env python
import os
import torch

# Example: FLUX model inference on HPU via optimum-habana pipeline
from optimum.habana.diffusers import GaudiFluxPipeline
hpu_configs = {
    "use_habana": True,
    "use_hpu_graphs": True,
    "sdp_on_bf16": True,
    "gaudi_config": "Habana/stable-diffusion",
}
pipe = GaudiFluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16, **hpu_configs)
prompt = "A picture of sks dog in a bucket"

# Quantize FLUX transformer to FP8 using INC
from neural_compressor.torch.quantization import FP8Config, convert, prepare, finalize_calibration
quant_configs = {
    "mode": "AUTO",
    "observer": "maxabs",
    "scale_method": "maxabs_hw",
    "allowlist": {"types": [], "names":  []},
    "blocklist": {"types": [], "names":  []},
    "dump_stats_path": "/tmp/hqt_output/measure"
}
config = FP8Config(**quant_configs)
pipe.transformer = prepare(pipe.transformer, config)
pipe(prompt)
finalize_calibration(pipe.transformer)
pipe.transformer = convert(pipe.transformer)

# Load LoRA weights with PEFT
pipe.load_lora_weights("dsocek/lora-flux-dog", adapter_name="user_lora") # <--- FAILS WITHOUT INC DISPATCHER

# Run inference
image = pipe(prompt).images[0]
image.save("dog.png")

Requirements:

pip install optimum-habana sentencepiece neural-compressor[pt] peft

Apr 19 '25 16:04 dsocek

I assume some features like merging of LoRA weights would not work yet?

We could address merging and other additional features in a different PR, if that's ok

Apr 19 '25 16:04 dsocek

Thanks for providing the example. I have a small comment. Could you please also run make style?

I think it would be great to add it to peft/examples/stable_diffusion/. Let's also add a check at the top of the script if the required hardware is present, + a small description what the script does.

Sounds good, I will fix style with make style in next commit. Are you thinking to add a standalone HPU INC quantized Flux LoRA inference example py file, or somehow integrate into existing example script? A standalone can be done quickly. Do you also want to have example with and without quantization (somehow control via args or so)?

A check is_hpu_available() was recently added to accelerate (see here), should we consider adding similar check in peft/import_utils and just call this from the example?

We could address merging and other additional features in a different PR, if that's ok

Generally, that would be fine, we have other quantization without merging support. However, right now, the implementation dispatches to the vanilla LoRA layers, so if a user calls merge, the normal merge method is called, which would result in an error (I assume, I haven't tested) or may even fail silently. This is not very user friendly. IMO it would be better to dispatch to special INC layers, similar to what we do with other quantization methods. The layers would not implement merge and unmerge, or raise a NotImplementedError, so that users know what's going on. Later, it will be easier to add those methods to the INC layers. WDYT?

I see, so you would like to add specific INC LoRA layers which define merge and unmerge (like say in HQQ) at least as placeholders with NotImplementedError, correct?

I am not sure when/how is merge/unmerge called with diffusers (sorry for my misunderstanding here) to even test what happens currently :) I know we can call something like pipe.set_adapters(["user_lora", "user_lora2"], adapter_weights=[0.7, 0.8]) for example to "merge" 2 adapters but I think maybe this is not the same merge/unmerge on layer level. I also know one can use pipe.fuse_lora() which will fuse lora with actual model (but we can't unfuse then), will this trigger merge?

Apr 22 '25 22:04 dsocek

Are you thinking to add a standalone HPU INC quantized Flux LoRA inference example py file, or somehow integrate into existing example script? A standalone can be done quickly. Do you also want to have example with and without quantization (somehow control via args or so)?

Standalone is fine.

A check is_hpu_available() was recently added to accelerate (see here), should we consider adding similar check in peft/import_utils and just call this from the example?

Normally, we would just use accelerate, but since it's so recent, I think it's better to copy+paste with a comment about the source.

I see, so you would like to add specific INC LoRA layers which define merge and unmerge (like say in HQQ) at least as placeholders with NotImplementedError, correct?

Yes, that's the suggestion.

I am not sure when/how is merge/unmerge called with diffusers (sorry for my misunderstanding here) to even test what happens currently :)

It looks like it's used here: https://github.com/huggingface/diffusers/blob/026507c06cdabfb0c13ddeb1ac33f5d8e244361f/src/diffusers/loaders/peft.py#L698-L737

But even if it weren't: This is not exclusive to diffusion models, right? People could use INC for normal LLMs and then later may want to merge the weights, for instance for faster inference. So having the option would be good in the long run, but not necessary for a start. If you need help with testing, LMK.

Apr 23 '25 10:04 BenjaminBossan

@BenjaminBossan Thanks for quick reply and clarification. All sounds good. I will implement the placeholders and should be able to test them via merge/unmerge with an LLM case, rather than diffusers case.

Normally, we would just use accelerate, but since it's so recent, I think it's better to copy+paste with a comment about the source.

I assume you mean copy paste this function directly in the example script and not add to peft/import_utils, correct?

Apr 24 '25 00:04 dsocek

I will implement the placeholders and should be able to test them via merge/unmerge with an LLM case, rather than diffusers case.

Thanks, sounds good.

I assume you mean copy paste this function directly in the example script and not add to peft/import_utils, correct?

If it's only needed there, then yes.

Apr 24 '25 09:04 BenjaminBossan

@BenjaminBossan Thanks again for quick turnaround!

I updated PR with 2 more commits, but basically:

added INC linear layer class with merge/unmerge placeholders
added new example in examples\stable-diffusion

Additionally, I did test both of these. The new example works without issues on my side. For the placeholders I creates a small LLM test case and I see both placeholders work as expected:

merge() test:

...
2025-04-24 14:13:32 [INFO][quantize.py:226] Start to convert model with fp8_quant.
2025-04-24 14:13:34 [INFO][test_llm.py:69] Conversion end.
Traceback (most recent call last):
  File "/root/./test.py", line 80, in <module>
    model = model.merge_and_unload()  # Attempt to merge LoRA weights
  File "/root/peft/src/peft/tuners/lora/model.py", line 903, in merge_and_unload
    return self._unload_and_optionally_merge(
  File "/root/peft/src/peft/tuners/lora/model.py", line 533, in _unload_and_optionally_merge
    target.merge(safe_merge=safe_merge, adapter_names=adapter_names)
  File "/root/peft/src/peft/tuners/lora/inc.py", line 46, in merge
    raise NotImplementedError("Merging LoRA with INC layers is not yet implemented")
NotImplementedError: Merging LoRA with INC layers is not yet implemented

unmerge() test:

...
2025-04-24 14:12:30 [INFO][quantize.py:226] Start to convert model with fp8_quant.
2025-04-24 14:12:32 [INFO][test_llm.py:69] Conversion end.
Traceback (most recent call last):
  File "/root/./test.py", line 81, in <module>
    model = model.unmerge_adapter()  # Attempt to unmerge LoRA weights
  File "/root/peft/src/peft/tuners/tuners_utils.py", line 611, in unmerge_adapter
    module.unmerge()
  File "/root/peft/src/peft/tuners/lora/inc.py", line 52, in unmerge
    raise NotImplementedError("Unmerging LoRA from INC layers is not yet implemented")
NotImplementedError: Unmerging LoRA from INC layers is not yet implemented

Let me know if any more changes are needed to this PR.

Apr 24 '25 14:04 dsocek

I assume that to use INC, I would need the correct hardware, i.e. I wouldn't be able to test this locally or on the PEFT CI

Yes, you would need Intel Gaudi (HPU) for this

Would the plan be that Intel is running its own CI to check that it works? In that case, we should add a unit test, even if it's just a very functional one. For instance, we could copy the bnb test here and make adjustments for INC. Of course, the test should be skipped if the required packages and/or hardware are not available. Intel could then run this test on their CI and inform us if there is any regression in PEFT that breaks the integration.

Yes we can easily add such a test in Optimum-Habana, which uses PEFT in a number of samples. I am contributing to OH actively so should be possible to add a CI test which will run this or similar example and if there is future failure it will be caught quickly and addressed. We already have a number of PEFT related tests in OH for llms or diffusers.

Apr 25 '25 20:04 dsocek

@BenjaminBossan I fixed the small issue you indicated, see my earlier comment for your other concern. Thanks again for quick and thorough review!

Also I did a final test and the updated example works on HPU as before, and on a system without HPU device now we get error as expected:

$ python inc_flux_lora_hpu.py
2025-04-25 20:43:06 [WARNING][auto_accelerator.py:418] Auto detect accelerator: CUDA_Accelerator.
Traceback (most recent call last):
  File "/root/peft/examples/stable_diffusion/inc_flux_lora_hpu.py", line 35, in <module>
    raise RuntimeError("HPU device not found. This code requires Intel Gaudi device to run.")
RuntimeError: HPU device not found. This code requires Intel Gaudi device to run.

Apr 25 '25 20:04 dsocek

@BenjaminBossan Thanks again for a thorough review. I incorporated your final suggestions into the PR:

added note about external handling of CI tests to inc.py
added section on INC quantization in the docs (with caveat subsection)

Apr 28 '25 14:04 dsocek

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Apr 28 '25 15:04 HuggingFaceDocBuilderDev

peft
peft copied to clipboard

Add INC dispatcher

What does this PR do?

peft peft copied to clipboard

Add INC dispatcher

What does this PR do?

peft
peft copied to clipboard