ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

weight_dtype fp8_e4m3fn slower than weight_dtype default problem seems fixed today

Open deepfree2023 opened this issue 1 year ago • 8 comments

Your question

Just mention it to all the guys who've had this problem before.

I've got 3.6x s/it today instead of about 20s/it before, with weight_dtype fp8_e4m3fn.

And it's still about 5s/it with weight_dtype default.

Logs

No response

Other

No response

deepfree2023 avatar Aug 20 '24 10:08 deepfree2023

I've just updated.

For me Flux fp8 is still slower than Flux fp16.

I'm on Windows 10 GTX 1070 ( 8GB VRAM ) 32 GB RAM

I'm using the UNET loader.

unet

JorgeR81 avatar Aug 20 '24 11:08 JorgeR81

Have you updated all nodes?

Don't know if it's related to the new UI change, the main difference in the log seems to be "[rgthree] NOTE: Will NOT use rgthree's optimized recursive execution as ComfyUI has changed."

I'm using win10, torch 2.4.0+CUDA12.1 with xformers 0.0.27.post2, 12G vram + 64G ram, UNET loader.

The slow FP8 problem really bothered me for a long time.

deepfree2023 avatar Aug 20 '24 15:08 deepfree2023

I updated all via manager.


Don't know if it's related to the new UI change, the main difference in the log seems to be "[rgthree] NOTE: Will NOT use rgthree's optimized recursive execution as ComfyUI has changed."

Yeah, that seems to be related to the new UI: https://github.com/rgthree/rgthree-comfy/issues/304

JorgeR81 avatar Aug 20 '24 15:08 JorgeR81

torch 2.4.0+CUDA12.1 with xformers 0.0.27.post2

This might be a problem. They don't recommend using 2.4 yet with windows, should be 2.3.1+cu121 if I recall correctly. There are some issues with 2.4.0 and Windows at the moment (unless you're using a nightly build). Also xformers isn't really needed these days since pytorch is usually equal to or better.

And some older generation GPUs have half speed fp16 operations, which these weights are likely being cast to and/or calculated at.

RandomGitUser321 avatar Aug 20 '24 17:08 RandomGitUser321

And some older generation GPUs have half speed fp16 operations, which these weights are likely being cast to and/or calculated at.

I have a Pascal GPU, so this commit ( https://github.com/comfyanonymous/ComfyUI/commit/8115d8cce97a3edaaad8b08b45ab37c6782e1cb4 ) made generation slower for me, both with Flux fp16 and fp8.

But after the recent performance improvements, Flux fp16 got even faster than before. Now, Flux GUFF and Flux fp16 have about the same speed, for me. Only Flux FP8 is slower.

JorgeR81 avatar Aug 20 '24 18:08 JorgeR81

torch 2.4.0+CUDA12.1 with xformers 0.0.27.post2

This might be a problem. They don't recommend using 2.4 yet with windows, should be 2.3.1+cu121 if I recall correctly. There are some issues with 2.4.0 and Windows at the moment (unless you're using a nightly build). Also xformers isn't really needed these days since pytorch is usually equal to or better.

And some older generation GPUs have half speed fp16 operations, which these weights are likely being cast to and/or calculated at.

I tried to install Torch 2.3.1 1-2 days ago, and the FP8 speed increased to 10.x s/it, which was still much slower than FP16. So I switched back to Torch 2.4.0 (official version) and found that the FP8 speed had increased significantly a few hours ago(I accidentally tried FP8 at that time), even with Torch 2.4.0.

Don't know the exact reason, but I've searched all kinds of analysis and possible solutions before, like lots of guys with the same problem, and never got fixed this FP8 speed problem, but now it seems to be fixed.

I tested Xformers and Pytorch cross-attention again with current comfyui update, Xformers was a bit faster, 3.6x s/it vs 3.9x/it.

deepfree2023 avatar Aug 20 '24 18:08 deepfree2023

torch 2.4.0+CUDA12.1 with xformers 0.0.27.post2

This might be a problem. They don't recommend using 2.4 yet with windows, should be 2.3.1+cu121 if I recall correctly. There are some issues with 2.4.0 and Windows at the moment (unless you're using a nightly build). Also xformers isn't really needed these days since pytorch is usually equal to or better. And some older generation GPUs have half speed fp16 operations, which these weights are likely being cast to and/or calculated at.

I tried to install Torch 2.3.1 1-2 days ago, and the FP8 speed increased to 10.x s/it, which was still much slower than FP16. So I switched back to Torch 2.4.0 (official version) and found that the FP8 speed had increased significantly a few hours ago(I accidentally tried FP8 at that time), even with Torch 2.4.0.

Don't know the exact reason, but I've searched all kinds of analysis and possible solutions before, like lots of guys with the same problem, and never got fixed this FP8 speed problem, but now it seems to be fixed.

I tested Xformers and Pytorch cross-attention again with current comfyui update, Xformers was a bit faster, 3.6x s/it vs 3.9x/it.

which version of Xformers did you install

omarei-omoto avatar Aug 21 '24 11:08 omarei-omoto

xformers 0.0.27.post2

deepfree2023 avatar Aug 21 '24 13:08 deepfree2023