ComfyUI Flux and Nvidia P40 - force-fp32 needed on latest source code.

Expected Behavior

Graphics card: Nvidia P40 Model: Flux Schnell - float8_e4m3fn format

After commit 8115d8cce97a3edaaad8b08b45ab37c6782e1cb4 ComfyUI takes about 3x the time to generate an image on flux schnell, from ~200 seconds to ~600 seconds. Starting ComfyUI with --force-fp32 returns old generating speed.

Tested with latest commit (commit f1d6cef71c70719cc3ed45a2455a4e5ac910cd5e currently), same behavior.

Actual Behavior

Actual behavior: As described already.

Steps to Reproduce

Steps to reproduce: Run latest ComfyUI Flux on P40

Debug Logs

With force-FP32:


[START] Security scan
[DONE] Security scan
## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2024-08-14 21:55:11.939544
** Platform: Linux
** Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
** Python executable: ComfyUI/venv/bin/python
** ComfyUI Path: ComfyUI
** Log path: ComfyUI/comfyui.log

Prestartup times for custom nodes:
   1.3 seconds: ComfyUI/custom_nodes/ComfyUI-Manager

Total VRAM 24439 MB, total RAM 128783 MB
pytorch version: 2.4.0+cu121
Forcing FP32, if this improves things please report it.
Set vram state to: NORMAL_VRAM
Device: cuda:0 Tesla P40 : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: ComfyUI/web
ComfyUI/venv/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
2024-08-14 21:55:20.191 | WARNING  | Jovimetrix.core.utility:<module>:51 - no gifski support
2024-08-14 21:55:20.520 | ERROR    | Jovimetrix.sup.stream:<module>:33 - NO SPOUT GL SUPPORT
2024-08-14 21:55:20.520 | ERROR    | Jovimetrix.sup.stream:<module>:34 - No module named 'SpoutGL'
ALSA lib seq_hw.c:466:(snd_seq_hw_open) open /dev/snd/seq failed: Permission denied
2024-08-14 21:55:20.540 | ERROR    | Jovimetrix.sup.midi:midi_device_names:48 - midi devices are offline
### Loading: ComfyUI-Manager (V2.48.4)
### ComfyUI Revision: 2539 [f1d6cef7] | Released on '2024-08-14'

Import times for custom nodes:
   0.0 seconds: ComfyUI/custom_nodes/websocket_image_save.py
   0.0 seconds: ComfyUI/custom_nodes/ComfyUI-Manager
   0.7 seconds: ComfyUI/custom_nodes/comfyui-dynamicprompts
   2.1 seconds: ComfyUI/custom_nodes/Jovimetrix

Starting server

To see the GUI go to: http://0.0.0.0:8188
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
[ WARN:[email protected]] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video0): can't open camera by index
[ERROR:[email protected]] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
[ WARN:[email protected]] global cap_v4l.cpp:999 open VIDEOIO(V4L2:/dev/video1): can't open camera by index
[ERROR:[email protected]] global obsensor_uvc_stream_channel.cpp:158 getStreamChannelGroup Camera index out of range
FETCH DATA from: ComfyUI/custom_nodes/ComfyUI-Manager/extension-node-map.json [DONE]
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float32
model_type FLOW
New prompt: A pretty picture of a cat warrior fighting a band of mouse warriors, conan style
Requested to load FluxClipModel_
Loading 1 new model
clip missing: ['text_projection.weight']
Requested to load Flux
Loading 1 new model
 25%|██████████████████████████████████████████████████████████▊                                                                                                                                                                                | 1/4 [00:22<01:06, 22.24s/it]
 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                                                          | 3/4 [01:39<00:35, 35.04s/it]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:18<00:00, 34.58s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 195.32 seconds
FETCH DATA from: ComfyUI/custom_nodes/ComfyUI-Manager/extension-node-map.json [DONE]
got prompt
New prompt: A pretty picture of a cat warrior fighting a band of mouse warriors, conan style
loaded completely 20862.241284454347 11340.293029785156
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [02:18<00:00, 34.62s/it]
Prompt executed in 166.50 seconds

Without:

[START] Security scan
[DONE] Security scan
## ComfyUI-Manager: installing dependencies done.
** ComfyUI startup time: 2024-08-14 14:27:18.482763
** Platform: Linux
** Python version: 3.10.12 (main, Jul 29 2024, 16:56:48) [GCC 11.4.0]
** Python executable: ComfyUI/venv/bin/python
** ComfyUI Path: ComfyUI
** Log path: ComfyUI/comfyui.log

Prestartup times for custom nodes:
   1.5 seconds: ComfyUI/custom_nodes/ComfyUI-Manager

Total VRAM 24439 MB, total RAM 128783 MB
pytorch version: 2.4.0+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 Tesla P40 : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: ComfyUI/web
ComfyUI/venv/lib/python3.10/site-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
2024-08-14 14:27:26.699 | WARNING  | Jovimetrix.core.utility:<module>:51 - no gifski support
2024-08-14 14:27:27.026 | ERROR    | Jovimetrix.sup.stream:<module>:33 - NO SPOUT GL SUPPORT
2024-08-14 14:27:27.027 | ERROR    | Jovimetrix.sup.stream:<module>:34 - No module named 'SpoutGL'
ALSA lib seq_hw.c:466:(snd_seq_hw_open) open /dev/snd/seq failed: Permission denied
2024-08-14 14:27:27.047 | ERROR    | Jovimetrix.sup.midi:midi_device_names:48 - midi devices are offline
### Loading: ComfyUI-Manager (V2.48.4)
### ComfyUI Revision: 2492 [66d42332] *DETACHED | Released on '2024-08-08'

Import times for custom nodes:
   0.0 seconds: ComfyUI/custom_nodes/websocket_image_save.py
   0.1 seconds: ComfyUI/custom_nodes/ComfyUI-Manager
   0.7 seconds: ComfyUI/custom_nodes/comfyui-dynamicprompts
   2.0 seconds: ComfyUI/custom_nodes/Jovimetrix

Starting server

To see the GUI go to: http://0.0.0.0:8188
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/alter-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/model-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/github-stats.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json
[ComfyUI-Manager] default cache updated: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/extension-node-map.json
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type FLOW
New prompt: A pretty picture of a cat warrior fighting a band of mouse warriors, conan style
Model doesn't have a device attribute.
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
Loading 1 new model
Requested to load Flux
Loading 1 new model
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [08:15<00:00, 123.90s/it]
Using pytorch attention in VAE
Using pytorch attention in VAE
Model doesn't have a device attribute.
Requested to load AutoencodingEngine
Loading 1 new model
Prompt executed in 596.66 seconds

Other

No response

Aug 14 '24 22:08 TheTerrasque

I have a GTX 1070 ( 8GB ) Using --force-fp32 also improves my inference speed ( by about 50% )

While the fp8 model is loading it needs to be upcasted to another format. This commit allows it to be upcasted to fp16, instead of fp32. So using --force-fp32, seems to revert what the commit is doing. I think the purpose of this commit is saving RAM, while the model is loading, but in my case it does not work. ( I still need above 32 GB RAM while loading the fp8 model ).

What I still don't understand is why upcasting to fp16 makes the inference slower.

Aug 14 '24 22:08 JorgeR81

( Github just went down for a while here )

Anyways, I think I've found why this is slower for us, on a Pascal GPU

https://forums.developer.nvidia.com/t/fp16-support-on-gtx-1060-and-1080/53256/2

All GPUs with compute capability 6.1 (e.g. GTX 1050, 1060, 1070, 1080, Pascal Titan X, Titan Xp, Tesla P40, etc.) have low-rate FP16 performance.

Also, I think this is why Invoke AI does not recommend these cards https://invoke-ai.github.io/InvokeAI/installation/INSTALL_REQUIREMENTS/

They cannot operate with half precision

Aug 14 '24 23:08 JorgeR81

P40 is slow for stable diffusion but enough fast for LLM on llmstudio or ollama with GGUF model.

What drivers do you use on windows ? grid ? data center ?

Feb 13 '25 10:02 gandolfi974