Add GGUF loader for FluxTransformer2DModel
GGUF is becoming a preferred means of distribution of FLUX fine-tunes.
Transformers recently added general support for GGUF and are slowly adding support for additional model types.
(implementation is by adding gguf_file param to from_pretrained method)
This PR adds support for loading GGUF files to T5EncoderModel.
I've tested the code with quants available at https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/tree/main and its working with current Flux implementation in diffusers.
However, as FluxTransformer2DModel is defined in diffusers library, support has to be added here to be able to load actual transformer model which is most (if not all) of Flux finetunes.
Examples that can be used:
- https://civitai.com/models/657607/gguf-fastflux-flux1-schnell-merged-with-flux1-dev
with weights quantized as q4_0, q4_1, q5_0, q5_1 - https://civitai.com/models/662958/flux1-dev-gguf-f16
with weights simply converted from f16
cc: @yiyixuxu @sayakpaul @DN6
Perhaps after #9213.
Note that exotic FPX schemes are already supported (FP6, FP5, FP4) with torchao. Check out this repo for that: https://github.com/sayakpaul/diffusers-torchao
yes, i'm following that pr closely :)
also, torchao work makes all this easier. request here is not to reimplement any of the quantization work done so far, but to add diffusers equivalent of transformers.modeling_gguf_pytorch_utils.load_gguf_checkpoint() which returns state_dict (with key re-mapping as needed) and then the rest of the load can be as-is.
Yeah for sure. Thanks for following along!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Right up our alley. Cc: @DN6
@sayakpaul @DN6 if you want to take a look...
simple implementation of generic gguf loader that loads state_dict:
https://github.com/vladmandic/automatic/blob/dev/modules/ggml/__init__.py
and from there its simple to create diffusers class - i later use it to create FluxTransformer2DModel in https://github.com/vladmandic/automatic/blob/56ec09fac8db9fa01f2eeff8f955ef6c91f85451/modules/model_flux.py#L111
GGUF is becoming a preferred means of distribution of FLUX fine-tunes.
Transformers recently added general support for GGUF and are slowly adding support for additional model types. (implementation is by adding
gguf_fileparam tofrom_pretrainedmethod)This PR adds support for loading GGUF files to
T5EncoderModel. I've tested the code with quants available at https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/tree/main and its working with current Flux implementation in diffusers.However, as
FluxTransformer2DModelis defined in diffusers library, support has to be added here to be able to load actual transformer model which is most (if not all) of Flux finetunes.Examples that can be used:
- https://civitai.com/models/657607/gguf-fastflux-flux1-schnell-merged-with-flux1-dev with weights quantized as q4_0, q4_1, q5_0, q5_1
- https://civitai.com/models/662958/flux1-dev-gguf-f16 with weights simply converted from f16
cc: @yiyixuxu @sayakpaul @DN6
Can you provide a simple demo to use the gguf format model in diffusers? I do not know exactly how to use the gguf model.
from modules.model_flux import load_flux_gguf
import torch, pdb
from diffusers import FluxPipeline
file_path = '/maindata/data/shared/public/yang.zhang/models/flux/flux-schnell-dev-merge-q4-1.gguf'
transformer, _ = load_flux_gguf(file_path)
dtype = torch.float16
bfl_repo = '/maindata/data/shared/public/yang.zhang/models/flux/FLUX.1-dev'
pipe = FluxPipeline.from_pretrained(bfl_repo, torch_dtype=dtype, transformer=transformer).to('cuda')
prompt = 'a cat'
cfg = 3.5
step = 30
image = pipe(
prompt,
height=1024,
width=1024,
guidance_scale=cfg,
num_inference_steps=step,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(100),
).images[0]
image.save(f"res/flux-gguf_cfg{cfg}_step{step}.png")
I use above code but get error info
06:52:09-580434 INFO Device detect: memory=79.3 optimization=none
06:52:09-592030 INFO Engine: backend=Backend.DIFFUSERS compute=cuda device=cuda attention="Scaled-Dot-Product" mode=no_grad
06:52:09-597271 ERROR styles failed to migrate: file="styles.csv" error=partially initialized module 'modules.shared' has no attribute 'max_workers' (most likely due to a circular import)
06:52:09-610222 INFO Torch parameters: backend=cuda device=cuda config=Auto dtype=torch.bfloat16 vae=torch.bfloat16 unet=torch.bfloat16 context=no_grad nohalf=False nohalfvae=False
upscast=False deterministic=False test-fp16=True test-bf16=True optimization="Scaled-Dot-Product"
06:52:09-613630 ERROR Package: ['onnx'] 'NoneType' object has no attribute 'working_set'
06:52:09-643598 INFO Device: device=NVIDIA A800-SXM4-80GB n=2 arch=sm_90 capability=(8, 0) cuda=12.1 cudnn=90100 driver=470.161.03
470.161.03
06:52:09-661209 ERROR Package: ['gguf'] 'NoneType' object has no attribute 'working_set'
06:52:09-662821 INFO Install: package="gguf" mode=pip
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.45s/it]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:24<00:00, 3.52s/it]
0%| | 0/30 [00:00<?, ?it/s]
╭──────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────────────────────────────────────────╮
│ /maindata/data/shared/public/songtao.tian/flux/gguf/automatic/test_gguf.py:18 in <module> │
│ │
│ 17 step = 30 │
│ ❱ 18 image = pipe( │
│ 19 prompt, │
│ │
│ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/utils/_contextlib.py:116 in decorate_context │
│ │
│ 115 with ctx_factory(): │
│ ❱ 116 return func(*args, **kwargs) │
│ 117 │
│ │
│ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/diffusers/pipelines/flux/pipeline_flux.py:730 in __call__ │
│ │
│ 729 │
│ ❱ 730 noise_pred = self.transformer( │
│ 731 hidden_states=latents, │
│ │
│ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/module.py:1553 in _wrapped_call_impl │
│ │
│ 1552 else: │
│ ❱ 1553 return self._call_impl(*args, **kwargs) │
│ 1554 │
│ │
│ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/module.py:1562 in _call_impl │
│ │
│ 1561 or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1562 return forward_call(*args, **kwargs) │
│ 1563 │
│ │
│ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py:447 in forward │
│ │
│ 446 ) │
│ ❱ 447 hidden_states = self.x_embedder(hidden_states) │
│ 448 │
│ │
│ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/module.py:1553 in _wrapped_call_impl │
│ │
│ 1552 else: │
│ ❱ 1553 return self._call_impl(*args, **kwargs) │
│ 1554 │
│ │
│ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/module.py:1562 in _call_impl │
│ │
│ 1561 or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1562 return forward_call(*args, **kwargs) │
│ 1563 │
│ │
│ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/linear.py:117 in forward │
│ │
│ 116 def forward(self, input: Tensor) -> Tensor: │
│ ❱ 117 return F.linear(input, self.weight, self.bias) │
│ 118 │
│ │
│ /maindata/data/shared/public/songtao.tian/flux/gguf/automatic/modules/ggml/gguf_tensor.py:148 in __torch_dispatch__ │
│ │
│ 147 if func in GGML_TENSOR_OP_TABLE: │
│ ❱ 148 return GGML_TENSOR_OP_TABLE[func](func, args, kwargs) │
│ 149 else: │
│ │
│ /maindata/data/shared/public/songtao.tian/flux/gguf/automatic/modules/ggml/gguf_tensor.py:18 in dequantize_and_run │
│ │
│ 17 } │
│ ❱ 18 return func(*dequantized_args, **dequantized_kwargs) │
│ 19 │
│ │
│ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/_ops.py:667 in __call__ │
│ │
│ 666 # are named "self". This way, all the aten ops can be called by kwargs. │
│ ❱ 667 return self_._op(*args, **kwargs) │
│ 668 │
RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
GGUF is becoming a preferred means of distribution of FLUX fine-tunes. Transformers recently added general support for GGUF and are slowly adding support for additional model types. (implementation is by adding
gguf_fileparam tofrom_pretrainedmethod) This PR adds support for loading GGUF files toT5EncoderModel. I've tested the code with quants available at https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/tree/main and its working with current Flux implementation in diffusers. However, asFluxTransformer2DModelis defined in diffusers library, support has to be added here to be able to load actual transformer model which is most (if not all) of Flux finetunes. Examples that can be used:
- https://civitai.com/models/657607/gguf-fastflux-flux1-schnell-merged-with-flux1-dev with weights quantized as q4_0, q4_1, q5_0, q5_1
- https://civitai.com/models/662958/flux1-dev-gguf-f16 with weights simply converted from f16
cc: @yiyixuxu @sayakpaul @DN6
Can you provide a simple demo to use the gguf format model in diffusers? I do not know exactly how to use the gguf model.
from modules.model_flux import load_flux_gguf import torch, pdb from diffusers import FluxPipeline file_path = '/maindata/data/shared/public/yang.zhang/models/flux/flux-schnell-dev-merge-q4-1.gguf' transformer, _ = load_flux_gguf(file_path) dtype = torch.float16 bfl_repo = '/maindata/data/shared/public/yang.zhang/models/flux/FLUX.1-dev' pipe = FluxPipeline.from_pretrained(bfl_repo, torch_dtype=dtype, transformer=transformer).to('cuda') prompt = 'a cat' cfg = 3.5 step = 30 image = pipe( prompt, height=1024, width=1024, guidance_scale=cfg, num_inference_steps=step, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(100), ).images[0] image.save(f"res/flux-gguf_cfg{cfg}_step{step}.png")I use above code but get error info
06:52:09-580434 INFO Device detect: memory=79.3 optimization=none 06:52:09-592030 INFO Engine: backend=Backend.DIFFUSERS compute=cuda device=cuda attention="Scaled-Dot-Product" mode=no_grad 06:52:09-597271 ERROR styles failed to migrate: file="styles.csv" error=partially initialized module 'modules.shared' has no attribute 'max_workers' (most likely due to a circular import) 06:52:09-610222 INFO Torch parameters: backend=cuda device=cuda config=Auto dtype=torch.bfloat16 vae=torch.bfloat16 unet=torch.bfloat16 context=no_grad nohalf=False nohalfvae=False upscast=False deterministic=False test-fp16=True test-bf16=True optimization="Scaled-Dot-Product" 06:52:09-613630 ERROR Package: ['onnx'] 'NoneType' object has no attribute 'working_set' 06:52:09-643598 INFO Device: device=NVIDIA A800-SXM4-80GB n=2 arch=sm_90 capability=(8, 0) cuda=12.1 cudnn=90100 driver=470.161.03 470.161.03 06:52:09-661209 ERROR Package: ['gguf'] 'NoneType' object has no attribute 'working_set' 06:52:09-662821 INFO Install: package="gguf" mode=pip Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:22<00:00, 11.45s/it] Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:24<00:00, 3.52s/it] 0%| | 0/30 [00:00<?, ?it/s] ╭──────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────────────────────────────────────────╮ │ /maindata/data/shared/public/songtao.tian/flux/gguf/automatic/test_gguf.py:18 in <module> │ │ │ │ 17 step = 30 │ │ ❱ 18 image = pipe( │ │ 19 prompt, │ │ │ │ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/utils/_contextlib.py:116 in decorate_context │ │ │ │ 115 with ctx_factory(): │ │ ❱ 116 return func(*args, **kwargs) │ │ 117 │ │ │ │ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/diffusers/pipelines/flux/pipeline_flux.py:730 in __call__ │ │ │ │ 729 │ │ ❱ 730 noise_pred = self.transformer( │ │ 731 hidden_states=latents, │ │ │ │ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/module.py:1553 in _wrapped_call_impl │ │ │ │ 1552 else: │ │ ❱ 1553 return self._call_impl(*args, **kwargs) │ │ 1554 │ │ │ │ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/module.py:1562 in _call_impl │ │ │ │ 1561 or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1562 return forward_call(*args, **kwargs) │ │ 1563 │ │ │ │ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/diffusers/models/transformers/transformer_flux.py:447 in forward │ │ │ │ 446 ) │ │ ❱ 447 hidden_states = self.x_embedder(hidden_states) │ │ 448 │ │ │ │ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/module.py:1553 in _wrapped_call_impl │ │ │ │ 1552 else: │ │ ❱ 1553 return self._call_impl(*args, **kwargs) │ │ 1554 │ │ │ │ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/module.py:1562 in _call_impl │ │ │ │ 1561 or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1562 return forward_call(*args, **kwargs) │ │ 1563 │ │ │ │ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/nn/modules/linear.py:117 in forward │ │ │ │ 116 def forward(self, input: Tensor) -> Tensor: │ │ ❱ 117 return F.linear(input, self.weight, self.bias) │ │ 118 │ │ │ │ /maindata/data/shared/public/songtao.tian/flux/gguf/automatic/modules/ggml/gguf_tensor.py:148 in __torch_dispatch__ │ │ │ │ 147 if func in GGML_TENSOR_OP_TABLE: │ │ ❱ 148 return GGML_TENSOR_OP_TABLE[func](func, args, kwargs) │ │ 149 else: │ │ │ │ /maindata/data/shared/public/songtao.tian/flux/gguf/automatic/modules/ggml/gguf_tensor.py:18 in dequantize_and_run │ │ │ │ 17 } │ │ ❱ 18 return func(*dequantized_args, **dequantized_kwargs) │ │ 19 │ │ │ │ /home/songtao.tian/anaconda3/envs/gguf/lib/python3.10/site-packages/torch/_ops.py:667 in __call__ │ │ │ │ 666 # are named "self". This way, all the aten ops can be called by kwargs. │ │ ❱ 667 return self_._op(*args, **kwargs) │ │ 668 │ RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
I get the same error, do you solve it ?
Set dtype = torch.bfloat16 in this demo. Then run the demo again. Locate the new error and set q, k, v to the same dtype
Set dtype = torch.bfloat16 in this demo. Then run the demo again. Locate the new error and set q, k, v to the same dtype
Hello, I use the gguf q5 model,but the GPU memory usage is higher. Is your GPU memory reduced?
Any progress here?
I successfully loaded the weights in gguf format, but only models with 0,1 suffixes work, those with K,S suffixes do not. (diffusers-0.31.0-dev)
import ggml
import torch
import torch.nn as nn
def load_flux_gguf(file_path, transformer_config, dtype, device):
transformer = None
# model_te.install_gguf()
from accelerate import init_empty_weights
from diffusers.loaders.single_file_utils import convert_flux_transformer_checkpoint_to_diffusers
# from modules import ggml
with init_empty_weights():
from diffusers import FluxTransformer2DModel
config = FluxTransformer2DModel.load_config(transformer_config)
transformer = FluxTransformer2DModel.from_config(config).to(dtype)
expected_state_dict_keys = list(transformer.state_dict().keys())
state_dict, stats = ggml.load_gguf_state_dict(file_path, dtype)
state_dict = convert_flux_transformer_checkpoint_to_diffusers(state_dict)
applied, skipped = 0, 0
for param_name, param in state_dict.items():
if param_name not in expected_state_dict_keys:
# shared.log.warning(f'Load model: type=Unet/Transformer param={param_name} unexpected')
skipped += 1
continue
applied += 1
hijack_set_module_tensor_simple(transformer, tensor_name=param_name, value=param, device=device)
state_dict[param_name] = None
# shared.log.debug(f'Load model: type=Unet/Transformer applied={applied} skipped={skipped} stats={stats}')
return transformer, `None`
def hijack_set_module_tensor_simple(module,tensor_name,device,value):
if "." in tensor_name:
splits = tensor_name.split(".")
for split in splits[:-1]:
module = getattr(module, split)
tensor_name = splits[-1]
old_value = getattr(module, tensor_name)
with torch.no_grad():
if tensor_name in module._buffers:
module._buffers[tensor_name] = value.to(device, non_blocking=True)
elif value is not None:
param_cls = type(module._parameters[tensor_name])
module._parameters[tensor_name] = param_cls(value, requires_grad=False).to(device, non_blocking=True)
unet_path = '/yourpath/flux1-dev-Q8_0.gguf'
transformer_config = 'yourpath/flux-dev/transformer'
dtype = torch.float16
device = 'cuda:0'
gguf_transformer,_ = load_flux_gguf(unet_path, transformer_config, dtype, device)
import torch
from diffusers import FluxTransformer2DModel, FluxPipeline
from transformers import T5EncoderModel, CLIPTextModel
from optimum.quanto import freeze, qfloat8, quantize, qint8, qint4
dtype = torch.float16
pipe = FluxPipeline.from_pretrained("/yourpath/flux-dev", torch_dtype=dtype)
pipe.transformer = gguf_transformer
pipe.to('cuda:0')
prompt = "minimalism,Chinese ink painting,ink painting,close-up,1girl,solo,portrait,closed_eyes,eyeshadow,gloves,makeup,lipstick jewelry,earrings,necklace,hat,long hair,dress,high qulity,extremely detaile,offcial art,Uniform 8K wallpaper,super detailing,32K,"
image = pipe(
prompt,
guidance_scale=3.5,
output_type="pil",
num_inference_steps=20,
generator=torch.Generator("cpu").manual_seed(1024)
).images[0]
@zhaowendao30 thanks for this!
Could you maybe modify your comment to include ggml installation instruction and the checkpoint you used?
@zhaowendao30 thanks for this!
Could you maybe modify your comment to include
ggmlinstallation instruction and the checkpoint you used? ggml repo is https://github.com/vladmandic/automatic/blob/dev/modules/ggml, copy it to your local path
Thanks! And the checkpoint you used?
Thanks! And the checkpoint you used?
https://huggingface.co/city96/FLUX.1-dev-gguf/tree/main, only models with 0,1 suffixes work, those with K,S suffixes do not.
just one word of caution - the code relies on gguf package which has a really bad installer - see https://github.com/ggerganov/llama.cpp/issues/9566
Being worked in https://github.com/huggingface/diffusers/pull/9964
Closing since #9964 was merged. Feel free to reopen if there are any issues.