diffusers [LoRA] Quanto Flux LoRA can't load

Describe the bug

Cannot load LoRAs into quanto-quantized Flux.

import torch 
from diffusers import FluxTransformer2DModel, FluxPipeline
from huggingface_hub import hf_hub_download
from optimum.quanto import qfloat8, quantize, freeze
from transformers import T5EncoderModel

bfl_repo = "black-forest-labs/FLUX.1-dev"
dtype = torch.bfloat16

transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype)
quantize(transformer, weights=qfloat8)
freeze(transformer)

text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype)
pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2

pipe.load_lora_weights(
    hf_hub_download("ByteDance/Hyper-SD", "Hyper-FLUX.1-dev-8steps-lora.safetensors"), adapter_name="hyper-sd"
)

Logs

ERROR:
Traceback (most recent call last):
  File "/home/user/genAI/test.py", line 56, in <module>
    pipe.load_lora_weights(
  File "/home/user/miniconda3/lib/python3.12/site-packages/diffusers/loaders/lora_pipeline.py", line 1867, in load_lora_weights
    transformer_lora_state_dict = self._maybe_expand_lora_state_dict(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/miniconda3/lib/python3.12/site-packages/diffusers/loaders/lora_pipeline.py", line 2490, in _maybe_expand_lora_state_dict
    base_weight_param = transformer_state_dict[base_param_name]
                        ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
KeyError: 'single_transformer_blocks.0.attn.to_k.weight'

System Info

Python 3.12 diffusers 0.32.0 (I tested 0.32.1 and install from git)

Who can help?

@sayakpaul

Jan 09 '25 17:01 Mino1289

Can you try with diffusers installation from main?

pip uninstall diffusers -y
pip install git+https://github.com/huggingface/diffusers

Jan 10 '25 01:01 sayakpaul

Can you try with diffusers installation from main?
pip uninstall diffusers -y
pip install git+https://github.com/huggingface/diffusers

The problem is still exit after this operation

Jan 10 '25 02:01 lhjlhj11

Do you have a minimal reproducible snippet? The provided one isn't minimal and self-contained. I keep asking for that because we have an integration test for Kohya LoRAs here:

https://github.com/huggingface/diffusers/blob/83ba01a38d94466ab16ab99c0d2bd74e463561de/tests/lora/test_lora_layers_flux.py#L847

It was run yesterday, too, and it worked fine.

Jan 10 '25 02:01 sayakpaul

Do you have a minimal reproducible snippet? The provided one isn't minimal and self-contained. I keep asking for that because we have an integration test for Kohya LoRAs here:

https://github.com/huggingface/diffusers/blob/83ba01a38d94466ab16ab99c0d2bd74e463561de/tests/lora/test_lora_layers_flux.py#L847

It was run yesterday, too, and it worked fine.

This issue only occurs when loading LoRA after quantizing the FLUX transformer using optimum.quanto. If the model is not quantized, LoRA can be loaded normally. In version 0.31 of diffusers, LoRA could be loaded successfully even after quantization.

Jan 10 '25 02:01 tyyff

Do you have a minimal reproducible snippet? The provided one isn't minimal and self-contained. I keep asking for that because we have an integration test for Kohya LoRAs here:

https://github.com/huggingface/diffusers/blob/83ba01a38d94466ab16ab99c0d2bd74e463561de/tests/lora/test_lora_layers_flux.py#L847

It was run yesterday, too, and it worked fine.

transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors", torch_dtype=dtype)
quantize(transformer, weights=qfloat8)
freeze(transformer)
text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)
pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype)
pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2
# this is a 8steps lora
self.pipe.load_lora_weights(load_file(os.path.join(self.model_root, self.config["8steps_lora"]), device=self.device), adapter_name="8steps")
self.pipe.set_adapters(["8steps"], adapter_weights=[0.125])

Jan 10 '25 02:01 lhjlhj11

@tyyff if you could help me with a minimally reproducible snippet that would be great, ideally with a supported quantization backend like bitsandbytes.

Jan 10 '25 07:01 sayakpaul

I used the script and quantization method here : https://gist.github.com/sayakpaul/b664605caf0aa3bf8585ab109dd5ac9c The script by AmericanPresidentJimmyCarter.

Jan 10 '25 14:01 Mino1289

@tyyff if you could help me with a minimally reproducible snippet that would be great, ideally with a supported quantization backend like bitsandbytes.

Can you solve the problems with flux-fp8 version? Thanks!!!

Jan 11 '25 00:01 lhjlhj11

@tyyff if you could help me with a minimally reproducible snippet that would be great, ideally with a supported quantization backend like bitsandbytes.

Or can diffusers under 0.32.0 support flux redux?

Jan 11 '25 00:01 lhjlhj11

@tyyff if you could help me with a minimally reproducible snippet that would be great, ideally with a supported quantization backend like bitsandbytes.

Just a combination of two examples from the article on using Flux

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, FluxPriorReduxPipeline, FluxControlPipeline, FluxTransformer2DModel, FluxPipeline
from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
from diffusers.utils import load_image
from image_gen_aux import DepthPreprocessor
from diffusers.utils import load_image
from huggingface_hub import hf_hub_download

text_encoder_8bit = T5EncoderModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="text_encoder_2",
    quantization_config=DiffusersBitsAndBytesConfig(load_in_8bit=True),
    torch_dtype=torch.float16,
)

transformer_8bit = FluxTransformer2DModel.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    subfolder="transformer",
    quantization_config=DiffusersBitsAndBytesConfig(load_in_8bit=True),
    torch_dtype=torch.float16,
)

control_pipe = FluxControlPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    text_encoder=text_encoder_8bit,
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

control_pipe.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora", adapter_name="depth")
control_pipe.load_lora_weights(
    hf_hub_download("ByteDance/Hyper-SD", "Hyper-FLUX.1-dev-8steps-lora.safetensors"), adapter_name="hyper-sd"
)
control_pipe.set_adapters(["depth", "hyper-sd"], adapter_weights=[0.85, 0.125])
control_pipe.enable_model_cpu_offload()

prompt = "A robot made of exotic candies and chocolates of different kinds. The background is filled with confetti and celebratory gifts."
control_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/robot.png")

processor = DepthPreprocessor.from_pretrained("LiheYoung/depth-anything-large-hf")
control_image = processor(control_image)[0].convert("RGB")

image = control_pipe(
    prompt=prompt,
    control_image=control_image,
    height=1024,
    width=1024,
    num_inference_steps=8,
    guidance_scale=10.0,
    generator=torch.Generator().manual_seed(42),
).images[0]

images[0].save("out.jpg")

    control_pipe.load_lora_weights("black-forest-labs/FLUX.1-Depth-dev-lora", adapter_name="depth")
  File "/usr/local/lib/python3.10/dist-packages/diffusers/loaders/lora_pipeline.py", line 1856, in load_lora_weights
    has_param_with_expanded_shape = self._maybe_expand_transformer_param_shape_or_error_(
  File "/usr/local/lib/python3.10/dist-packages/diffusers/loaders/lora_pipeline.py", line 2359, in _maybe_expand_transformer_param_shape_or_error_
    expanded_module = torch.nn.Linear(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/linear.py", line 99, in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parameter.py", line 40, in __new__
    return torch.Tensor._make_subclass(cls, data, requires_grad)
RuntimeError: Only Tensors of floating point and complex dtype can require gradients

Jan 12 '25 18:01 Yakonrus

@tyyff if you could help me with a minimally reproducible snippet that would be great, ideally with a supported quantization backend like bitsandbytes.

import torch
from diffusers import FluxTransformer2DModel, FluxPipeline
from transformers import T5EncoderModel, CLIPTextModel
import os
from optimum.quanto import freeze, qfloat8, quantize
import random

bfl_repo = "black-forest-labs/FLUX.1-schnell"
dtype = torch.bfloat16

transformer = FluxTransformer2DModel.from_pretrained(bfl_repo, subfolder="transformer", torch_dtype=dtype)
quantize(transformer, weights=qfloat8)
freeze(transformer)

text_encoder_2 = T5EncoderModel.from_pretrained(bfl_repo, subfolder="text_encoder_2", torch_dtype=dtype)
quantize(text_encoder_2, weights=qfloat8)
freeze(text_encoder_2)

pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=None, text_encoder_2=None, torch_dtype=dtype)
pipe.transformer = transformer
pipe.text_encoder_2 = text_encoder_2
pipe.to(torch.device("cuda"))
pipe.load_lora_weights("Shakker-Labs/FLUX.1-dev-LoRA-Logo-Design", weight_name="FLUX-dev-lora-Logo-Design.safetensors")
seed = random.randint(1, 1 << 32)
image = pipe(
    prompt="logo,Minimalist,A bunch of grapes and a wine glass",
    guidance_scale=1.,
    output_type="pil",
    num_inference_steps=8,
    generator=torch.Generator("cpu").manual_seed(seed)
).images[0]

image.save("test.png")

  File "/nanjgrowth-train-public/root/nanjgrowth-public-1/tangweiye/training/flux_finetuning/test_utils/test_lora_snippet.py", l
ine 23, in <module>                                                                                                             
    pipe.load_lora_weights("Shakker-Labs/FLUX.1-dev-LoRA-Logo-Design", weight_name="FLUX-dev-lora-Logo-Design.safetensors")     
  File "/root/micromamba/envs/twy_diffusers/lib/python3.10/site-packages/diffusers/loaders/lora_pipeline.py", line 1866, in load
_lora_weights                                                                                                                   
    transformer_lora_state_dict = self._maybe_expand_lora_state_dict(                                                           
  File "/root/micromamba/envs/twy_diffusers/lib/python3.10/site-packages/diffusers/loaders/lora_pipeline.py", line 2415, in _may
be_expand_lora_state_dict                                                                                                       
    base_weight_param = transformer_state_dict[base_param_name]                                                                 
KeyError: 'single_transformer_blocks.0.attn.to_k.weight'

This is pip requirements.txt:

absl-py==2.1.0
accelerate==1.2.1
annotated-types==0.7.0
bitsandbytes==0.45.0
certifi==2024.12.14
charset-normalizer==3.4.1
deepspeed==0.15.4
diffusers==0.32.1
einops==0.8.0
filelock==3.13.1
fsspec==2024.2.0
grpcio==1.68.1
hjson==3.1.0
huggingface-hub==0.27.0
idna==3.10
importlib_metadata==8.5.0
Jinja2==3.1.3
Markdown==3.7
MarkupSafe==2.1.5
mpmath==1.3.0
msgpack==1.1.0
networkx==3.2.1
ninja==1.11.1.3
numpy==1.26.3
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-ml-py==12.560.30
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.1.105
nvidia-nvtx-cu12==12.1.105
optimum-quanto==0.2.6
packaging==24.2
peft==0.14.0
pillow==10.2.0
protobuf==5.29.2
psutil==6.1.1
py-cpuinfo==9.0.0
pydantic==2.10.4
pydantic_core==2.27.2
PyYAML==6.0.2
regex==2024.11.6
requests==2.32.3
safetensors==0.4.5
sentencepiece==0.2.0
six==1.17.0
sympy==1.13.1
tensorboard==2.18.0
tensorboard-data-server==0.7.2
tokenizers==0.21.0
torch==2.4.1+cu121
torchaudio==2.4.1+cu121
torchvision==0.19.1+cu121
tqdm==4.67.1
transformers==4.47.1
triton==3.0.0
typing_extensions==4.12.2
urllib3==2.3.0
Werkzeug==3.1.3
zipp==3.21.0

Jan 13 '25 06:01 tyyff

Tracking here: https://github.com/huggingface/diffusers/issues/10550.

Jan 13 '25 06:01 sayakpaul

I tested with v0.31.0-release and it fails with:

Error

Traceback (most recent call last):
  File "/home/sayak/diffusers/check_fp8.py", line 22, in <module>
    pipe.load_lora_weights(
  File "/home/sayak/diffusers/src/diffusers/loaders/lora_pipeline.py", line 1846, in load_lora_weights
    self.load_lora_into_transformer(
  File "/home/sayak/diffusers/src/diffusers/loaders/lora_pipeline.py", line 1949, in load_lora_into_transformer
    incompatible_keys = set_peft_model_state_dict(transformer, state_dict, adapter_name, **peft_kwargs)
  File "/home/sayak/peft/src/peft/utils/save_and_load.py", line 445, in set_peft_model_state_dict
    load_result = model.load_state_dict(peft_model_state_dict, strict=False, assign=True)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2564, in load_state_dict
    load(self, state_dict)
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2552, in load
    load(child, child_state_dict, child_prefix)  # noqa: F821
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2552, in load
    load(child, child_state_dict, child_prefix)  # noqa: F821
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2552, in load
    load(child, child_state_dict, child_prefix)  # noqa: F821
  [Previous line repeated 1 more time]
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2535, in load
    module._load_from_state_dict(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/optimum/quanto/nn/qmodule.py", line 160, in _load_from_state_dict
    deserialized_weight = WeightQBytesTensor.load_from_state_dict(
  File "/home/sayak/.pyenv/versions/3.10.12/envs/diffusers/lib/python3.10/site-packages/optimum/quanto/tensor/weights/qbytes.py", line 77, in load_from_state_dict
    inner_tensors_dict[name] = state_dict.pop(prefix + name)
KeyError: 'time_text_embed.timestep_embedder.linear_1.base_layer.weight._data'

Tracking it here: https://github.com/huggingface/diffusers/issues/10550#issuecomment-2588917365

Jan 14 '25 09:01 sayakpaul

As I read the issue and PR you linked, the issue i'm facing is most likely due to quanto not being supported with peft. Using BitsAndBytesConfig should bypass the problem, right? I'll try later.

Jan 14 '25 18:01 Mino1289

Yes, you're right. 4Bit support is being added in https://github.com/huggingface/diffusers/pull/10578/.

However, I just edited your issue title a bit to reflect that Quanto support needs to be added. Hope that is okay with you.

Jan 15 '25 01:01 sayakpaul

And in 8 bit ? My issue is about 8 bit of qfloat8.

It's okay for the name. Thanks for the quick support!

Jan 15 '25 09:01 Mino1289

Both 4bit and 8bit bitsandbytes models should be able to load LoRAs.

For 8bit, make sure you install peft from its source. If you face problems, please open a new issue.

Jan 15 '25 09:01 sayakpaul

And diffusers 0.32.1 or from source ?

Jan 15 '25 13:01 Mino1289

Source.

Jan 15 '25 13:01 sayakpaul

Hi, we are getting this error only from diffusers > 0.31. when diffusers == 0.31 we can load LoRa and perform inference. this setup works: diffusers 0.31.0 transformers 4.48.1 torch 2.5.1 torchaudio 2.5.1 torchvision 0.20.1

same setup with 0.31> diffusers doesn't work with KeyError: 'single_transformer_blocks.0.attn.to_k.weight'

@sayakpaul

Jan 23 '25 15:01 Amitg1

Can you provide a reproducible snippet?

Jan 23 '25 16:01 sayakpaul

@sayakpaul

I think he posted here https://github.com/huggingface/diffusers/issues/10512#issuecomment-2586234886

Jan 23 '25 17:01 nitinmukesh

not the same person :) our example is pretty much the same, except we are not executing quantize and freeze and are using the following class to load the quantized transformer :

    class QuantizedFluxTransformer2DModel(QuantizedDiffusersModel):
        base_class = FluxTransformer2DModel

and the following to load the quantized encoder:

    class QuantizedModelForTextEncoding(QuantizedTransformersModel):
        auto_class = AutoModelForTextEncoding

the repo we are using is Disty0/FLUX.1-dev-qint8

diffusers 0.31.0 transformers 4.48.1 torch 2.5.1 torchaudio 2.5.1 torchvision 0.20.1

Jan 23 '25 18:01 Amitg1

https://github.com/huggingface/diffusers/issues/10512#issuecomment-2589386647

Jan 24 '25 06:01 sayakpaul

@sayakpaul The issue above still persists in latest diffuser version. I think the problem correlates with trying to mix different dtypes. I have a QINT8 quantized transformer and an unquantized fp16 lora safetensor file. The 2 don't mix.

error 'single_transformer_blocks.0.attn.to_k.weight'

adapter_id1 = "C:/Users/xxxxx/.cache/huggingface/hub/lora/FLUX/aidmaHyperrealism-FLUX-v0.3.safetensors" pipe.load_lora_weights(adapter_id1) <<< produces error

FWIW a work around solution is to load the lora with your transformer "BEFORE" quantizing the transformer. Saw this on Reddit:

`The second issue is to load the unquantized lora. The trick is to load it before quantizing the transformer like this:

transformer = FluxTransformer2DModel.from_pretrained(bfl_repo, subfolder="transformer", torch_dtype=dtype)

pipe: FluxPipeline = FluxPipeline(scheduler=None, text_encoder=None, tokenizer=None, text_encoder_2=None, tokenizer_2=None, vae=None, transformer=transformer)

print("Loading lora") pipe.load_lora_weights(lora_weights) print("Fusing lora") pipe.fuse_lora() pipe.unload_lora_weights() transformer.to(device, dtype=dtype)

print("Quantizing transformer") quantize(transformer, weights=qtype) freeze(transformer) transformer.to(device, dtype=dtype) flush()`

Feb 01 '25 16:02 ukaprch

Thanks for the info!

Quanto support is not there in the PEFT which I confirmed here: https://github.com/huggingface/diffusers/issues/10550#issuecomment-2588917365. Another: https://github.com/huggingface/diffusers/issues/10512#issuecomment-2589386647
BnB should be supported: https://github.com/huggingface/diffusers/pull/10578.
BnB + Control LoRA will be supported soon: #10588

What am I missing?

Feb 01 '25 17:02 sayakpaul

Full Quanto support to have the ability to (somehow) load an (unquantized) lora with a quantized transformer. What I showed you above is an "on the fly" load process. What I have elected to do in the meantime b/c I don't use BitsandBytes but rather quanto is create my dummy pipeline so as to load the transformer (BEFORE) quantization, then load the lora of choice and quantize both into a new quantized transformer. This really does work. The only drawback is your stuck using the transformer/lora with whatever strength value you initially loaded it with. Here's my quantize code for loading the lora / transformer and then saving it so I can load for future inference:

`def quantize_transformer_lora_from_single_file(): from diffusers import FluxTransformer2DModel, FluxPipeline from optimum.quanto import freeze, qint8, quantize, quantization_map from pathlib import Path import json

################## FLUX QUANTIZE TRANSFORMER FROM SINGLE FILE #########################
# here is my transformer
base_model = 'C:/Users/xxxxx/.cache/huggingface/hub/Civitai Models/fluxmania_III.safetensors'

dtype = torch.bfloat16

# create dummy pipeline for the purpose of loading the transformer and lora:
transformer = FluxTransformer2DModel.from_single_file(base_model, subfolder="transformer", torch_dtype=dtype)
print('load pipeline')
pipe = FluxPipeline(scheduler=None, text_encoder=None, tokenizer=None, text_encoder_2=None, tokenizer_2=None, vae=None, transformer=transformer)

# we can load (1) lora or multiple here in our dummy pipeline:
adapter_id1 = "C:/Users/xxxxx/.cache/huggingface/hub/lora/FLUX/aidmaFLUXPro1.1-FLUX-v0.3.safetensors"
print('load lora')
pipe.load_lora_weights(adapter_id1)
print('fuse lora')
# make sure to specify weight!!
pipe.fuse_lora(lora_scale=0.7)                       #<<< choose your strength wisely as you will need to requantize to change!!
pipe.unload_lora_weights()

print('Quantize Transformer')
quantize(transformer, weights=qint8)
freeze(transformer)

print('save directory')
save_directory = "./flux-dev/lora/aidmaFLUXPro1.1/fluxmania/fluxtransformer2dmodel_qint8"
transformer.save_pretrained(save_directory)
qmap_name = Path(save_directory, "quanto_qmap.json")
qmap = quantization_map(transformer)
with  open (qmap_name, "w" , encoding= "utf8" ) as f:
    json.dump(qmap, f, indent= 4 )
print('Transformer done')
 
return

`

Feb 02 '25 13:02 ukaprch

Any news here?

Feb 18 '25 20:02 Amitg1

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 15 '25 15:03 github-actions[bot]

not stale

Mar 17 '25 22:03 bghira