manual cast: torch.bfloat16 when using fp8 combined flux.dev models causing vram issues with LoRAs
Expected Behavior
When using separate loaders for unet, clip and vae, in the console it says: model weight dtype torch.bfloat16, manual cast: None which is expected behavior for the combined models too. The fp8 combo models are: flux1.dev fp8, clip_l, and t5xxl fp8 e4m3fn.
Actual Behavior
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16 This doesn't occur when using separate loaders, only the combined/checkpoint one. This causes significant VRAM issues when using LoRAs
Steps to Reproduce
Download a combined flux dev model, such as: https://huggingface.co/Comfy-Org/flux1-dev Load the model using the checkpoint loader or similar. Observe the problems in the console. Additionally using any Ostris Lora will give OOM errors.
Debug Logs
2024-08-15 01:49:49.382 [ComfyUI-0] [STDERR] To see the GUI go to: http://127.0.0.1:7823
2024-08-15 01:50:28.189 [ComfyUI-0] [STDERR] got prompt
2024-08-15 01:50:28.500 [ComfyUI-0] [STDERR] model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
2024-08-15 01:50:28.503 [ComfyUI-0] [STDERR] model_type FLUX
2024-08-15 01:50:47.522 [ComfyUI-0] [STDERR] Using pytorch attention in VAE
2024-08-15 01:50:47.524 [ComfyUI-0] [STDERR] Using pytorch attention in VAE
2024-08-15 01:50:48.388 [ComfyUI-0] [STDERR] Requested to load FluxClipModel_
2024-08-15 01:50:48.388 [ComfyUI-0] [STDERR] Loading 1 new model
2024-08-15 01:50:57.779 [ComfyUI-0] [STDERR] loaded straight to GPU
2024-08-15 01:50:57.779 [ComfyUI-0] [STDERR] Requested to load Flux
2024-08-15 01:50:57.779 [ComfyUI-0] [STDERR] Loading 1 new model
2024-08-15 01:51:02.870 [ComfyUI-0] [STDERR] Prompt executed in 34.68 seconds
2024-08-15 01:51:03.369 [ComfyUI-0] [STDERR] got prompt
2024-08-15 01:51:03.584 [ComfyUI-0] [STDERR] model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
2024-08-15 01:51:03.585 [ComfyUI-0] [STDERR] model_type FLUX
2024-08-15 01:51:24.262 [ComfyUI-0] [STDERR] Using pytorch attention in VAE
2024-08-15 01:51:24.267 [ComfyUI-0] [STDERR] Using pytorch attention in VAE
2024-08-15 01:51:25.453 [ComfyUI-0] [STDERR] Requested to load FluxClipModel_
2024-08-15 01:51:25.453 [ComfyUI-0] [STDERR] Loading 1 new model
2024-08-15 01:51:33.667 [ComfyUI-0] [STDERR] loaded straight to GPU
2024-08-15 01:51:33.667 [ComfyUI-0] [STDERR] Requested to load Flux
2024-08-15 01:51:33.667 [ComfyUI-0] [STDERR] Loading 1 new model
2024-08-15 01:51:39.680 [ComfyUI-0] [STDERR] Requested to load FluxClipModel_
2024-08-15 01:51:39.680 [ComfyUI-0] [STDERR] Loading 1 new model
Other
No response
Got same issue - its causing horrible generation speed (20s/it instead of 2s/it) on a regular RTX 4070 12GB
Got same issue - its causing horrible generation speed (20s/it instead of 2s/it) on a regular RTX 4070 12GB
There have been several updates and rollbacks over the past few days. Have you updated to the latest version and is the same issue still occurring?
There have been several updates and rollbacks over the past few days. Have you updated to the latest version and is the same issue still occurring?
Yup:
ComfyUI Revision: 2542 [0f9c2a78] | Released on '2024-08-14'
Also: Total VRAM 12282 MB, total RAM 32691 MB pytorch version: 2.4.0+cu124 Set vram state to: LOW_VRAM Device: cuda:0 NVIDIA GeForce RTX 4070 : cudaMallocAsync Using pytorch cross attention
There have been several updates and rollbacks over the past few days. Have you updated to the latest version and is the same issue still occurring?
For what its worth, i tried setting a separate installation using older PyTorch and CUDA version: Total VRAM 12282 MB, total RAM 32691 MB pytorch version: 2.2.2+cu118 Set vram state to: LOW_VRAM Device: cuda:0 NVIDIA GeForce RTX 4070 : cudaMallocAsync Using pytorch cross attention
With an AIO model, i am now getting faster speed (~4s/it) - still slower than what i had with the unet model with 2.4.0+cu124 (~2s/it) tho.
Any idea if something on recent PyTorch or CUDA version causing memory allocation/speed issue?
Tested other mix of versions: Total VRAM 12282 MB, total RAM 32691 MB pytorch version: 2.3.1+cu121 Set vram state to: LOW_VRAM Device: cuda:0 NVIDIA GeForce RTX 4070 : cudaMallocAsync Using pytorch cross attention
It also works, although not optimal, with same speed as 2.2.2+cu118
Could you have a try on 2.4.0 + CU121 and 2.3.1 + CU124?
I have the same problem with flux dev 8fp and f8_e5m2/e4m3fn weights, model is always being manually casted to bfloat16, also I get spontaneous OOM and crashes in Comfy
This is still broken, BTW.
It will cause a 50% drop in performance if you use Ada generation FP8 optimized data center cards.
@wogam How did you even get it to work with separate loaders? Which loaders were used?
anyone solved this problem?
Still the same issue:
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
Updated to latest comfyui (update_comfyui_and_python_dependencies)
Ram - 64GB Processor - Core i7 14700K GPU - Nvidia GPU 4070ti 16GB
same issue on 4090, I tried multiple ways to get ride of it.
same on H100, torch.float8_e4m3fn is always cast to torch.bfloat16 and nothing works to change it
Same problem here. It really increases the generation time. I would love to know how to fix it.
happening with most up to date version as well
python main.py --gpu-only is a good workaround for my case.
I encountered the same error and ComfyUI crashed
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
fish: Job 1, 'python main.py' terminated by signal SIGKILL (Forced quit)
I ran dmesg | grep -i "killed process" confirmed out-of-memory led to the forced quit. And I noticed that my 3090 24GB VRAM wasn't used much at the time of crash. Therefore, instead of doing --lowvram mode, I tried --gpu-only mode to let the cast operation perform on GPU. However, the generation speed is as slow as 2s/it as reported by others.
still not fixed
This isn't just an issue for nvidia/cuda but M4/Metal for 4 and 8 bit quantized models too. Unfortunately, --gpu-only didn't work.
At least a workaround?
python main.py --gpu-onlyis a good workaround for my case.I encountered the same error and ComfyUI crashed
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16 model_type FLUX fish: Job 1, 'python main.py' terminated by signal SIGKILL (Forced quit)I ran
dmesg | grep -i "killed process"confirmed out-of-memory led to the forced quit. And I noticed that my 3090 24GB VRAM wasn't used much at the time of crash. Therefore, instead of doing --lowvram mode, I tried --gpu-only mode to let the cast operation perform on GPU. However, the generation speed is as slow as 2s/it as reported by others.
RTX 3090 doesn’t have fp8 support, so it will have to cast to bf16. That’s expected behaviour.
4070 super here and when selecting e4m3fn in the load diffusion model node, i'm being manually casted to bfloat16 also
torch 2.6.0 CU124 python 3.10
python main.py --gpu-onlyis a good workaround for my case. I encountered the same error and ComfyUI crashedmodel weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16 model_type FLUX fish: Job 1, 'python main.py' terminated by signal SIGKILL (Forced quit)I ran
dmesg | grep -i "killed process"confirmed out-of-memory led to the forced quit. And I noticed that my 3090 24GB VRAM wasn't used much at the time of crash. Therefore, instead of doing --lowvram mode, I tried --gpu-only mode to let the cast operation perform on GPU. However, the generation speed is as slow as 2s/it as reported by others.RTX 3090 doesn’t have fp8 support, so it will have to cast to bf16. That’s expected behaviour.
I used 4060ti 16gb which probably supports fp8, and also, if I'm not mistaken, even on 4090 I got the same behaviour
I was able to initially bypass the error by hardcoding self.manual_cast_dtype in ComfyUI\comfy\model_base.py:
self.manual_cast_dtype = model_config.scaled_fp8#self.manual_cast_dtype = model_config.manual_cast_dtype
Which led to this error:
File "C:\Tools\ComfyUI_windows_portable\ComfyUI\comfy\ldm\flux\model.py", line 198, in forward img_ids[:, :, 1] = img_ids[:, :, 1] + torch.linspace(0, h_len - 1, steps=h_len, device=x.device, dtype=x.dtype).unsqueeze(1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: "linspace_cuda" not implemented for 'Float8_e4m3fn'
Unfortunately as of Pytorch 2.6 it seems this is indeed not implemented:
https://pytorch.org/docs/stable/tensor_attributes.html#torch.dtype
Not confirmed
I have two absolutely identical radeon 7 cards, one of them in a vm, the second in a docker container, the hosts are different and the hardware is different. There were absolutely identical versions of the driver and the torch, rocm 6.2, the first card consistently generates at the same speed, the second is twice as slow, for example, if the first is a 5sec step, then the second is a 10sec step, I tried to find out the reason in the fall, but the second card suddenly started working identically to the first. On one of the updates, the speed has halved again since that moment. Yesterday I updated the rocm driver, completely updated comfyui, now when sending a queue of jobs on some wf, the generation rate is normal, on some it varies x2 from job to job, and this applies not only to flux with a separate clip node, but checkpoint sdxl behaves exactly the same way.
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
Now I have compared the comfyui output for both cards and the only difference in the output is this line.
Clip - fp8:
Perhaps this information will narrow down the search for the cause.
I forgot, and accordingly, there is not enough vram for the clip in fp16 and it is unloaded to the CPU.
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
I added --fp8_e4m3fn-unet --fp8_e4m3fn-text-enc to comfyui startup parameters, still experiencing the same issue. Any ideas?
ComfyAnonymous confirmed this is normal to be casted to bf16 while using model weight dtype torch.float8_e4m3fn.
#6913
yes, if you want to enable fp8 matrix multiplication you can use the --fast command line argument or use the fp8_e4m3fn_fast option in the "Load Diffusion Model" node.
It will however still show: manual cast: torch.bfloat16 because even with fp8 matrix mult the accumulation is done is higher precision so the intermediate values won't be fp8.
@PizzaSlice-cmd
- My issue is the performance and time wasted on it. Is that possible to avoid it?
- I tried fp8_e4m3fn_fast but nothing changed, still getting a cast.
I don't see anything wrong here ...
ComfyAnonymous confirmed this is normal to be casted to bf16 while using model weight dtype torch.float8_e4m3fn.
yes, if you want to enable fp8 matrix multiplication you can use the --fast command line argument or use the fp8_e4m3fn_fast option in the "Load Diffusion Model" node.
It will however still show: manual cast: torch.bfloat16 because even with fp8 matrix mult the accumulation is done is higher precision so the intermediate values won't be fp8.
Its surely not a normal thing. I tried with a double test : FP32 pruned model
fp8_e4m3fn model.
I get consistently the same result with both model. Which is a total non sens. Both are cast to bfloat16. I can use any model flux, same result. Something is wrong 100% sure.
4060ti here.
More explanation and photos on my post here : https://www.reddit.com/r/comfyui/comments/1nsq3gc/exact_same_result_bug_with_fp32_an_fp8_models/