weight_dtype fp8_e4m3fn slower than weight_dtype default problem seems fixed today
Your question
Just mention it to all the guys who've had this problem before.
I've got 3.6x s/it today instead of about 20s/it before, with weight_dtype fp8_e4m3fn.
And it's still about 5s/it with weight_dtype default.
Logs
No response
Other
No response
I've just updated.
For me Flux fp8 is still slower than Flux fp16.
I'm on Windows 10 GTX 1070 ( 8GB VRAM ) 32 GB RAM
I'm using the UNET loader.
Have you updated all nodes?
Don't know if it's related to the new UI change, the main difference in the log seems to be "[rgthree] NOTE: Will NOT use rgthree's optimized recursive execution as ComfyUI has changed."
I'm using win10, torch 2.4.0+CUDA12.1 with xformers 0.0.27.post2, 12G vram + 64G ram, UNET loader.
The slow FP8 problem really bothered me for a long time.
I updated all via manager.
Don't know if it's related to the new UI change, the main difference in the log seems to be "[rgthree] NOTE: Will NOT use rgthree's optimized recursive execution as ComfyUI has changed."
Yeah, that seems to be related to the new UI: https://github.com/rgthree/rgthree-comfy/issues/304
torch 2.4.0+CUDA12.1 with xformers 0.0.27.post2
This might be a problem. They don't recommend using 2.4 yet with windows, should be 2.3.1+cu121 if I recall correctly. There are some issues with 2.4.0 and Windows at the moment (unless you're using a nightly build). Also xformers isn't really needed these days since pytorch is usually equal to or better.
And some older generation GPUs have half speed fp16 operations, which these weights are likely being cast to and/or calculated at.
And some older generation GPUs have half speed fp16 operations, which these weights are likely being cast to and/or calculated at.
I have a Pascal GPU, so this commit ( https://github.com/comfyanonymous/ComfyUI/commit/8115d8cce97a3edaaad8b08b45ab37c6782e1cb4 ) made generation slower for me, both with Flux fp16 and fp8.
But after the recent performance improvements, Flux fp16 got even faster than before. Now, Flux GUFF and Flux fp16 have about the same speed, for me. Only Flux FP8 is slower.
torch 2.4.0+CUDA12.1 with xformers 0.0.27.post2
This might be a problem. They don't recommend using 2.4 yet with windows, should be 2.3.1+cu121 if I recall correctly. There are some issues with 2.4.0 and Windows at the moment (unless you're using a nightly build). Also xformers isn't really needed these days since pytorch is usually equal to or better.
And some older generation GPUs have half speed fp16 operations, which these weights are likely being cast to and/or calculated at.
I tried to install Torch 2.3.1 1-2 days ago, and the FP8 speed increased to 10.x s/it, which was still much slower than FP16. So I switched back to Torch 2.4.0 (official version) and found that the FP8 speed had increased significantly a few hours ago(I accidentally tried FP8 at that time), even with Torch 2.4.0.
Don't know the exact reason, but I've searched all kinds of analysis and possible solutions before, like lots of guys with the same problem, and never got fixed this FP8 speed problem, but now it seems to be fixed.
I tested Xformers and Pytorch cross-attention again with current comfyui update, Xformers was a bit faster, 3.6x s/it vs 3.9x/it.
torch 2.4.0+CUDA12.1 with xformers 0.0.27.post2
This might be a problem. They don't recommend using 2.4 yet with windows, should be 2.3.1+cu121 if I recall correctly. There are some issues with 2.4.0 and Windows at the moment (unless you're using a nightly build). Also xformers isn't really needed these days since pytorch is usually equal to or better. And some older generation GPUs have half speed fp16 operations, which these weights are likely being cast to and/or calculated at.
I tried to install Torch 2.3.1 1-2 days ago, and the FP8 speed increased to 10.x s/it, which was still much slower than FP16. So I switched back to Torch 2.4.0 (official version) and found that the FP8 speed had increased significantly a few hours ago(I accidentally tried FP8 at that time), even with Torch 2.4.0.
Don't know the exact reason, but I've searched all kinds of analysis and possible solutions before, like lots of guys with the same problem, and never got fixed this FP8 speed problem, but now it seems to be fixed.
I tested Xformers and Pytorch cross-attention again with current comfyui update, Xformers was a bit faster, 3.6x s/it vs 3.9x/it.
which version of Xformers did you install
xformers 0.0.27.post2