ComfyUI manual cast: torch.bfloat16 when using fp8 combined flux.dev models causing vram issues with LoRAs

Expected Behavior

When using separate loaders for unet, clip and vae, in the console it says: model weight dtype torch.bfloat16, manual cast: None which is expected behavior for the combined models too. The fp8 combo models are: flux1.dev fp8, clip_l, and t5xxl fp8 e4m3fn.

Actual Behavior

model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16 This doesn't occur when using separate loaders, only the combined/checkpoint one. This causes significant VRAM issues when using LoRAs

Steps to Reproduce

Download a combined flux dev model, such as: https://huggingface.co/Comfy-Org/flux1-dev Load the model using the checkpoint loader or similar. Observe the problems in the console. Additionally using any Ostris Lora will give OOM errors.

Debug Logs

2024-08-15 01:49:49.382 [ComfyUI-0] [STDERR] To see the GUI go to: http://127.0.0.1:7823
2024-08-15 01:50:28.189 [ComfyUI-0] [STDERR] got prompt
2024-08-15 01:50:28.500 [ComfyUI-0] [STDERR] model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
2024-08-15 01:50:28.503 [ComfyUI-0] [STDERR] model_type FLUX
2024-08-15 01:50:47.522 [ComfyUI-0] [STDERR] Using pytorch attention in VAE
2024-08-15 01:50:47.524 [ComfyUI-0] [STDERR] Using pytorch attention in VAE
2024-08-15 01:50:48.388 [ComfyUI-0] [STDERR] Requested to load FluxClipModel_
2024-08-15 01:50:48.388 [ComfyUI-0] [STDERR] Loading 1 new model
2024-08-15 01:50:57.779 [ComfyUI-0] [STDERR] loaded straight to GPU
2024-08-15 01:50:57.779 [ComfyUI-0] [STDERR] Requested to load Flux
2024-08-15 01:50:57.779 [ComfyUI-0] [STDERR] Loading 1 new model
2024-08-15 01:51:02.870 [ComfyUI-0] [STDERR] Prompt executed in 34.68 seconds
2024-08-15 01:51:03.369 [ComfyUI-0] [STDERR] got prompt
2024-08-15 01:51:03.584 [ComfyUI-0] [STDERR] model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
2024-08-15 01:51:03.585 [ComfyUI-0] [STDERR] model_type FLUX
2024-08-15 01:51:24.262 [ComfyUI-0] [STDERR] Using pytorch attention in VAE
2024-08-15 01:51:24.267 [ComfyUI-0] [STDERR] Using pytorch attention in VAE
2024-08-15 01:51:25.453 [ComfyUI-0] [STDERR] Requested to load FluxClipModel_
2024-08-15 01:51:25.453 [ComfyUI-0] [STDERR] Loading 1 new model
2024-08-15 01:51:33.667 [ComfyUI-0] [STDERR] loaded straight to GPU
2024-08-15 01:51:33.667 [ComfyUI-0] [STDERR] Requested to load Flux
2024-08-15 01:51:33.667 [ComfyUI-0] [STDERR] Loading 1 new model
2024-08-15 01:51:39.680 [ComfyUI-0] [STDERR] Requested to load FluxClipModel_
2024-08-15 01:51:39.680 [ComfyUI-0] [STDERR] Loading 1 new model

Other

No response

Aug 15 '24 01:08 wogam

Got same issue - its causing horrible generation speed (20s/it instead of 2s/it) on a regular RTX 4070 12GB

Aug 15 '24 03:08 benzstation

Got same issue - its causing horrible generation speed (20s/it instead of 2s/it) on a regular RTX 4070 12GB

There have been several updates and rollbacks over the past few days. Have you updated to the latest version and is the same issue still occurring?

Aug 15 '24 05:08 ltdrdata

There have been several updates and rollbacks over the past few days. Have you updated to the latest version and is the same issue still occurring?

Yup:

ComfyUI Revision: 2542 [0f9c2a78] | Released on '2024-08-14'

Also: Total VRAM 12282 MB, total RAM 32691 MB pytorch version: 2.4.0+cu124 Set vram state to: LOW_VRAM Device: cuda:0 NVIDIA GeForce RTX 4070 : cudaMallocAsync Using pytorch cross attention

Aug 15 '24 12:08 benzstation

There have been several updates and rollbacks over the past few days. Have you updated to the latest version and is the same issue still occurring?

For what its worth, i tried setting a separate installation using older PyTorch and CUDA version: Total VRAM 12282 MB, total RAM 32691 MB pytorch version: 2.2.2+cu118 Set vram state to: LOW_VRAM Device: cuda:0 NVIDIA GeForce RTX 4070 : cudaMallocAsync Using pytorch cross attention

With an AIO model, i am now getting faster speed (~4s/it) - still slower than what i had with the unet model with 2.4.0+cu124 (~2s/it) tho.

Any idea if something on recent PyTorch or CUDA version causing memory allocation/speed issue?

Aug 15 '24 13:08 benzstation

Tested other mix of versions: Total VRAM 12282 MB, total RAM 32691 MB pytorch version: 2.3.1+cu121 Set vram state to: LOW_VRAM Device: cuda:0 NVIDIA GeForce RTX 4070 : cudaMallocAsync Using pytorch cross attention

It also works, although not optimal, with same speed as 2.2.2+cu118

Aug 15 '24 14:08 benzstation

Could you have a try on 2.4.0 + CU121 and 2.3.1 + CU124?

Aug 17 '24 06:08 deepfree2023

I have the same problem with flux dev 8fp and f8_e5m2/e4m3fn weights, model is always being manually casted to bfloat16, also I get spontaneous OOM and crashes in Comfy

Sep 09 '24 05:09 MaratG2

This is still broken, BTW.

It will cause a 50% drop in performance if you use Ada generation FP8 optimized data center cards.

@wogam How did you even get it to work with separate loaders? Which loaders were used?

Sep 12 '24 15:09 KSimulation

anyone solved this problem?

Sep 27 '24 08:09 onmygame2

Still the same issue:

model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16

Updated to latest comfyui (update_comfyui_and_python_dependencies)

Ram - 64GB Processor - Core i7 14700K GPU - Nvidia GPU 4070ti 16GB

Oct 21 '24 06:10 iloveoss

same issue on 4090, I tried multiple ways to get ride of it.

Oct 25 '24 02:10 fly2outerspace

same on H100, torch.float8_e4m3fn is always cast to torch.bfloat16 and nothing works to change it

Nov 12 '24 16:11 AdamSlayer

Same problem here. It really increases the generation time. I would love to know how to fix it.

Nov 16 '24 13:11 HerMajestyDrMona

happening with most up to date version as well

Nov 29 '24 16:11 halumice

python main.py --gpu-only is a good workaround for my case.

I encountered the same error and ComfyUI crashed

model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
fish: Job 1, 'python main.py' terminated by signal SIGKILL (Forced quit)

I ran dmesg | grep -i "killed process" confirmed out-of-memory led to the forced quit. And I noticed that my 3090 24GB VRAM wasn't used much at the time of crash. Therefore, instead of doing --lowvram mode, I tried --gpu-only mode to let the cast operation perform on GPU. However, the generation speed is as slow as 2s/it as reported by others.

Dec 26 '24 05:12 munonex

still not fixed

Feb 01 '25 18:02 DariusVorona

This isn't just an issue for nvidia/cuda but M4/Metal for 4 and 8 bit quantized models too. Unfortunately, --gpu-only didn't work.

Feb 19 '25 04:02 e-desouza

At least a workaround?

Feb 20 '25 21:02 scofano

python main.py --gpu-only is a good workaround for my case.

I encountered the same error and ComfyUI crashed
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
fish: Job 1, 'python main.py' terminated by signal SIGKILL (Forced quit)
I ran dmesg | grep -i "killed process" confirmed out-of-memory led to the forced quit. And I noticed that my 3090 24GB VRAM wasn't used much at the time of crash. Therefore, instead of doing --lowvram mode, I tried --gpu-only mode to let the cast operation perform on GPU. However, the generation speed is as slow as 2s/it as reported by others.

RTX 3090 doesn’t have fp8 support, so it will have to cast to bf16. That’s expected behaviour.

Feb 20 '25 22:02 wogam

4070 super here and when selecting e4m3fn in the load diffusion model node, i'm being manually casted to bfloat16 also

torch 2.6.0 CU124 python 3.10

Feb 21 '25 01:02 PizzaSlice-cmd

python main.py --gpu-only is a good workaround for my case. I encountered the same error and ComfyUI crashed
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
fish: Job 1, 'python main.py' terminated by signal SIGKILL (Forced quit)
I ran dmesg | grep -i "killed process" confirmed out-of-memory led to the forced quit. And I noticed that my 3090 24GB VRAM wasn't used much at the time of crash. Therefore, instead of doing --lowvram mode, I tried --gpu-only mode to let the cast operation perform on GPU. However, the generation speed is as slow as 2s/it as reported by others.
RTX 3090 doesn’t have fp8 support, so it will have to cast to bf16. That’s expected behaviour.

I used 4060ti 16gb which probably supports fp8, and also, if I'm not mistaken, even on 4090 I got the same behaviour

Feb 21 '25 02:02 MaratG2

I was able to initially bypass the error by hardcoding self.manual_cast_dtype in ComfyUI\comfy\model_base.py:

self.manual_cast_dtype = model_config.scaled_fp8#self.manual_cast_dtype = model_config.manual_cast_dtype

Which led to this error:

File "C:\Tools\ComfyUI_windows_portable\ComfyUI\comfy\ldm\flux\model.py", line 198, in forward img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: "linspace_cuda" not implemented for 'Float8_e4m3fn'

Unfortunately as of Pytorch 2.6 it seems this is indeed not implemented:

https://pytorch.org/docs/stable/tensor_attributes.html#torch.dtype

Feb 26 '25 05:02 daanno2

Not confirmed

I have two absolutely identical radeon 7 cards, one of them in a vm, the second in a docker container, the hosts are different and the hardware is different. There were absolutely identical versions of the driver and the torch, rocm 6.2, the first card consistently generates at the same speed, the second is twice as slow, for example, if the first is a 5sec step, then the second is a 10sec step, I tried to find out the reason in the fall, but the second card suddenly started working identically to the first. On one of the updates, the speed has halved again since that moment. Yesterday I updated the rocm driver, completely updated comfyui, now when sending a queue of jobs on some wf, the generation rate is normal, on some it varies x2 from job to job, and this applies not only to flux with a separate clip node, but checkpoint sdxl behaves exactly the same way.

model weight dtype torch.float8_e4m3fn, manual cast: torch.float16 Now I have compared the comfyui output for both cards and the only difference in the output is this line.

Clip - fp8:

Perhaps this information will narrow down the search for the cause.

I forgot, and accordingly, there is not enough vram for the clip in fp16 and it is unloaded to the CPU. CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16

Mar 13 '25 19:03 imba-pericia

I added --fp8_e4m3fn-unet --fp8_e4m3fn-text-enc to comfyui startup parameters, still experiencing the same issue. Any ideas?

Mar 20 '25 15:03 CaledoniaProject

ComfyAnonymous confirmed this is normal to be casted to bf16 while using model weight dtype torch.float8_e4m3fn.

#6913

yes, if you want to enable fp8 matrix multiplication you can use the --fast command line argument or use the fp8_e4m3fn_fast option in the "Load Diffusion Model" node.

It will however still show: manual cast: torch.bfloat16 because even with fp8 matrix mult the accumulation is done is higher precision so the intermediate values won't be fp8.

Mar 20 '25 16:03 PizzaSlice-cmd

@PizzaSlice-cmd

My issue is the performance and time wasted on it. Is that possible to avoid it?
I tried fp8_e4m3fn_fast but nothing changed, still getting a cast.

I don't see anything wrong here ...

Mar 21 '25 01:03 CaledoniaProject

ComfyAnonymous confirmed this is normal to be casted to bf16 while using model weight dtype torch.float8_e4m3fn.

#6913

yes, if you want to enable fp8 matrix multiplication you can use the --fast command line argument or use the fp8_e4m3fn_fast option in the "Load Diffusion Model" node.

It will however still show: manual cast: torch.bfloat16 because even with fp8 matrix mult the accumulation is done is higher precision so the intermediate values won't be fp8.

Its surely not a normal thing. I tried with a double test : FP32 pruned model

fp8_e4m3fn model.

I get consistently the same result with both model. Which is a total non sens. Both are cast to bfloat16. I can use any model flux, same result. Something is wrong 100% sure.

4060ti here.

More explanation and photos on my post here : https://www.reddit.com/r/comfyui/comments/1nsq3gc/exact_same_result_bug_with_fp32_an_fp8_models/

Sep 28 '25 15:09 SYN-Sunyata