Can diffusers support loading and running FLUX with fp8 ?
This is how I use diffusers to load flux model:
import torch
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"/ckptstorage/repo/pretrained_weights/black-forest-labs/FLUX.1-dev",
torch_dtype=torch.float16,
)
device = torch.device(f"cuda:{device_number}" if torch.cuda.is_available() else "cpu")
pipe = pipe.to(device)
it consumes about 75 seconds on my computer with A800 GPU. But I found in comfyui, it only need 22 seconds to load flux model, but it load the fp8 model. Can diffusers load flux fp8 model ? or is there any other speed up method ?
what you consider slow is just that the model getting loaded into the VRAM, this depends a lot on your machine, how fast is your RAM and your VRAM, and it is also is CPU bound so you need a decent CPU.
Also consider that this is just for the first time, this process it's slow even in H100s but the speed of inference is fast, so this is only something you will experience if you load the model each time you do inference which is really bad and shouldn't be done.
We do have all kind of optimizations you can try, from something lossless like group offloading to something like layerwise casting and even quantization with a variety of backends (GGUF, BnB, Torchao). You can read and test the code in the docs here.
I really don't recommend lowering the model precision if you have the VRAM to load it with the intended precision, you're lowering the quality of the model just for a quick start.
@asomoza Hi, I am doing some experiments on image editing. Sometimes I modify a parameter and want to see the effect. I need to wait for more than a minute. Because I see that comfyui uses fp8's unet and t5, I wonder if diffusers can also load fp8 models. I used the following code
import torch
from diffusers import FluxPipeline
transformer = FluxTransformer2DModel.from_single_file("/ckptstorage/repo/pretrained_weights/black-forest-labs/FLUX.1-dev/flux1-dev-fp8.safetensors",local_files_only=True)
pipe = FluxPipeline.from_pretrained("/ckptstorage/repo/pretrained_weights/black-forest-labs/FLUX.1-dev",
transformer=transformer,
torch_dtype=torch.float16)
device = torch.device(f"cuda:{device_number}" if torch.cuda.is_available() else "cpu")
pipe = pipe.to(device)
but encountered an error :
Traceback (most recent call last):
File "/opt/conda/envs/py310/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 390, in load_config
config_file = hf_hub_download(
File "/opt/conda/envs/py310/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py310/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 961, in hf_hub_download
return _hf_hub_download_to_cache_dir(
File "/opt/conda/envs/py310/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1068, in _hf_hub_download_to_cache_dir
_raise_on_head_call_error(head_call_error, force_download, local_files_only)
File "/opt/conda/envs/py310/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1587, in _raise_on_head_call_error
raise LocalEntryNotFoundError(
huggingface_hub.errors.LocalEntryNotFoundError: Cannot find the requested files in the disk cache and outgoing traffic has been disabled. To enable hf.co look-ups and downloads online, set 'local_files_only' to False.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ckptstorage/repo/CVEdit/FlowEdit/run_script2.py", line 49, in <module>
transformer = FluxTransformer2DModel.from_single_file("/ckptstorage/repo/pretrained_weights/black-forest-labs/FLUX.1-dev/flux1-dev-fp8.safetensors",local_files_only=True)
File "/opt/conda/envs/py310/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py310/lib/python3.10/site-packages/diffusers/loaders/single_file_model.py", line 339, in from_single_file
diffusers_model_config = cls.load_config(
File "/opt/conda/envs/py310/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
return fn(*args, **kwargs)
File "/opt/conda/envs/py310/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 417, in load_config
raise EnvironmentError(
OSError: black-forest-labs/FLUX.1-dev does not appear to have a file named config.json.
@EmmaThompson123 To load the FP8 checkpoint can you try running
transformer = FluxTransformer2DModel.from_single_file("/ckptstorage/repo/pretrained_weights/black-forest-labs/FLUX.1-dev/flux1-dev-fp8.safetensors", torch_dtype=torch.bfloat16)
The first time you try to load the model, we will attempt to fetch the model config file if it doesn't exist locally. Which is what looks like is happening in your case.
Can you also share the output of diffusers-cli env
Additionally you also try running this snippet on your machine and share the output along with the CPU specs and available RAM?
import torch
import time
start = time.time()
from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained(
"/ckptstorage/repo/pretrained_weights/black-forest-labs/FLUX.1-dev",
torch_dtype=torch.float16,
)
print("Total Load Time (s): ", time.time() - start)
@DN6 The output of diffusers-cli env is:
/opt/conda/envs/py310/lib/python3.10/site-packages/pandas/core/computation/expressions.py:21: UserWarning: Pandas requires version '2.8.4' or newer of 'numexpr' (version '2.8.3' currently installed).
from pandas.core.computation.check import NUMEXPR_INSTALLED
2025-05-29 15:37:10.130809: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-05-29 15:37:10.175051: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
- 🤗 Diffusers version: 0.33.1
- Platform: Linux-3.10.0-1160.el7.x86_64-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.15
- PyTorch version (GPU?): 2.2.2+cu118 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.2
- Transformers version: 4.49.0
- Accelerate version: 1.7.0
- PEFT version: 0.14.0
- Bitsandbytes version: 0.43.2
- Safetensors version: 0.4.5
- xFormers version: 0.0.25.post1+cu118
- Accelerator: NVIDIA A800 80GB PCIe, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
The time it takes to load the model into ram and vram varies, sometimes it is 12 seconds + 22 seconds, sometimes it is 21 seconds + 49 seconds. I think that no matter what, if diffusers can support loading and running flux of fp8, it will be faster than float16.
The time it takes to load the model into ram and vram varies, sometimes it is 12 seconds + 22 seconds, sometimes it is 21 seconds + 49 seconds. I think that no matter what, if diffusers can support loading and running flux of fp8, it will be faster than float16.
It would be nice to be able to do so as often the comfy community is very active in preparing models in this format would be a good option to have with more consumer or enthusiast hardware and slower internet.
Also do we know if this gives better results than say bitandbytes int8 or torachAo int8, maybe it does if the comfy community have seemed to have settled on it. Be good to compare results.
Regarding mixed load times i do notice this also I think it varies because of recently accessed on the disk drive. You will get inconsistent times on loading.