ComfyUI SDXL generate black images with new --fast arg

Expected Behavior

Actual Behavior

SD15 and Flux work fine, the problem is only with SDXL

Comfyu version: https://github.com/comfyanonymous/ComfyUI/commit/bb4416dd5b2d7c2f34dc17e18761dd6b3d8b6ead

Steps to Reproduce

default workflow with SDXL model

Debug Logs

V:\comfyu_py311>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --fast --fp8_e4m3fn-unet --disable-all-custom-nodes --temp-directory "a:\comfyui-temp" --port 8190
Total VRAM 16376 MB, total RAM 130998 MB
pytorch version: 2.4.0+cu121
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4080 : cudaMallocAsync
Using pytorch cross attention
Setting temp directory to: a:\comfyui-temp\temp
[Prompt Server] web root: V:\comfyu_py311\ComfyUI\web
Adding extra search path checkpoints V:/auto1111-webui/models/Stable-diffusion
Adding extra search path configs V:/auto1111-webui/models/Stable-diffusion
Adding extra search path vae V:/auto1111-webui/models/VAE
Adding extra search path loras V:/auto1111-webui/models/Lora
Adding extra search path loras V:/auto1111-webui/models/LyCORIS
Adding extra search path upscale_models V:/auto1111-webui/models/ESRGAN
Adding extra search path upscale_models V:/auto1111-webui/models/RealESRGAN
Adding extra search path upscale_models V:/auto1111-webui/models/SwinIR
Adding extra search path embeddings V:/auto1111-webui/models/embeddings
Adding extra search path hypernetworks V:/auto1111-webui/models/hypernetworks
Adding extra search path controlnet V:/auto1111-webui/models/ControlNet
V:\comfyu_py311\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
Skipping loading of custom nodes
Starting server

To see the GUI go to: http://127.0.0.1:8190
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
V:\comfyu_py311\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Requested to load SDXLClipModel
Loading 1 new model
loaded completely 0.0 1560.802734375 True
V:\comfyu_py311\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
  out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load SDXL
Loading 1 new model
loaded completely 0.0 2448.5241737365723 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  5.98it/s]
Requested to load AutoencoderKL
Loading 1 new model
loaded completely 0.0 159.55708122253418 True
V:\comfyu_py311\ComfyUI\nodes.py:1498: RuntimeWarning: invalid value encountered in cast
  img = Image.fromarray(np.clip(i, 0, 255).astype(np.uint8))
Prompt executed in 12.89 seconds
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load SDXLClipModel
Loading 1 new model
loaded completely 0.0 1560.802734375 True
Requested to load SDXL
Loading 1 new model
loaded completely 0.0 2448.5241737365723 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  6.66it/s]
Requested to load AutoencoderKL
Loading 1 new model
loaded completely 0.0 159.55708122253418 True
Prompt executed in 37.42 seconds

Other

No response

Aug 23 '24 10:08 bananasss00

Your torch version is cu121. Update it to cu124 version.

Aug 23 '24 22:08 ltdrdata

Your torch version is cu121. Update it to cu124 version.

same problem with cu124. First generation SD15 - ok, second SDXL,

debug log

V:\comfyu_py311>.\python_embeded\python.exe -s ComfyUI\main.py --windows-standalone-build --fast --fp8_e4m3fn-unet --disable-all-custom-nodes --temp-directory "a:\comfyui-temp"
Total VRAM 16376 MB, total RAM 130998 MB
pytorch version: 2.4.0+cu124
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4080 : cudaMallocAsync
Using pytorch cross attention
Setting temp directory to: a:\comfyui-temp\temp
[Prompt Server] web root: V:\comfyu_py311\ComfyUI\web
Adding extra search path checkpoints V:/auto1111-webui/models/Stable-diffusion
Adding extra search path configs V:/auto1111-webui/models/Stable-diffusion
Adding extra search path vae V:/auto1111-webui/models/VAE
Adding extra search path loras V:/auto1111-webui/models/Lora
Adding extra search path loras V:/auto1111-webui/models/LyCORIS
Adding extra search path upscale_models V:/auto1111-webui/models/ESRGAN
Adding extra search path upscale_models V:/auto1111-webui/models/RealESRGAN
Adding extra search path upscale_models V:/auto1111-webui/models/SwinIR
Adding extra search path embeddings V:/auto1111-webui/models/embeddings
Adding extra search path hypernetworks V:/auto1111-webui/models/hypernetworks
Adding extra search path controlnet V:/auto1111-webui/models/ControlNet
V:\comfyu_py311\python_embeded\Lib\site-packages\kornia\feature\lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)
Skipping loading of custom nodes
Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
V:\comfyu_py311\python_embeded\Lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
Requested to load SD1ClipModel
Loading 1 new model
loaded completely 0.0 235.84423828125 True
V:\comfyu_py311\ComfyUI\comfy\ldm\modules\attention.py:407: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load BaseModel
Loading 1 new model
loaded completely 0.0 819.703067779541 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:01<00:00, 17.15it/s]
Requested to load AutoencoderKL
Loading 1 new model
loaded completely 0.0 159.55708122253418 True
Prompt executed in 8.37 seconds
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
Requested to load SDXLClipModel
Loading 1 new model
loaded completely 0.0 1560.802734375 True
Requested to load SDXL
Loading 1 new model
loaded completely 0.0 2448.5241737365723 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00,  8.51it/s]
Requested to load AutoencoderKL
Loading 1 new model
loaded completely 0.0 159.55708122253418 True
V:\comfyu_py311\ComfyUI\nodes.py:1498: RuntimeWarning: invalid value encountered in cast
img = Image.fromarray(np.clip(i, 0, 255).astype(np.uint8))
Prompt executed in 36.40 seconds

Aug 24 '24 07:08 bananasss00

I attempted to download the latest ComfyUI from https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.1.2. Afterward, I updated ComfyUI and torch+cu124. However, the issue with the SDXL model persists.

Aug 24 '24 07:08 bananasss00

I attempted to download the latest ComfyUI from https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.1.2. Afterward, I updated ComfyUI and torch+cu124. However, the issue with the SDXL model persists.

--fp8_e4m3fn-unet This option is the problem.

Aug 24 '24 07:08 ltdrdata

I attempted to download the latest ComfyUI from https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.1.2. Afterward, I updated ComfyUI and torch+cu124. However, the issue with the SDXL model persists.

--fp8_e4m3fn-unet This option is the problem.

The --fast optimization is specifically designed for fp8_e4m3fn. If I disable it, there will be no optimization

Aug 24 '24 08:08 bananasss00

I attempted to download the latest ComfyUI from https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.1.2. Afterward, I updated ComfyUI and torch+cu124. However, the issue with the SDXL model persists.

--fp8_e4m3fn-unet This option is the problem.

The --fast optimization is specifically designed for fp8_e4m3fn. If I disable it, there will be no optimization

In fact, the background for introducing that option was to help when loading and using FLUX.1 through the Load Diffusion Model function. It seems that the CLI option in question had not been tested. I have already communicated this issue to comfy.

Aug 24 '24 08:08 ltdrdata

Understood, thank you. It's strange that with this option, SD15 and Flux work normally.

Aug 24 '24 08:08 bananasss00

I found that the pony base model works fine with --fp8_e4m3fn-unet, however other SDXL variants do not.

Aug 26 '24 20:08 jkrauss82

any updates?

Sep 08 '24 15:09 Remember2015

Any updates??

Sep 22 '24 03:09 adamjen

Does the issue persist after using the fixed SDXL vae model?

Nov 18 '24 01:11 jetjodh

yes it persists, preview image is already black when the ksampler is running. It does not seem to have to do with the vae, vae encode/decode is running fine with fp8, otherwise pony would have the same problem. I am using the regular fp16 fixed vanilla sdxl vae

Nov 18 '24 06:11 jkrauss82

If you remove --fast from your Comfyui startup instructions does the issue persist?

Nov 18 '24 07:11 adamjen

No, but the point of this issue is that SDXL is not working with --fast. Without --fast things work as normal.

Currently, only pony models seem to work with --fast, older SDXL derivatives like Juggernaut or StarlightXL do not.

Nov 18 '24 07:11 jkrauss82

I did some digging and the problem seems related to this function used in forward_comfy_cast_weights. With only super-limited knowledge of what is going on under the hood I can only speculate we are getting some kind of "out of bounds" error where the available range covered by e4m3 float is not enough to support the range the forward step needs, causing the latent values to be corrupted.

I was hoping to try with e5m2 but this is not supported by cublast as pointed out by knowlegeable people here (thus, it is also not eligible for fp8_linear and comfyui won't execute the fast path when this dtype is chosen.

I would imagine that, if my speculation is true, we could mitigate the problem by applying some smarter kind of value scaling or maybe just use the possible min/max values should the function yield values exceeding the range covered by e4m3.

I would be happy to help developing / testing this further but I would need some pointers where to look next from someone with deeper knowledge of this context.

Nov 22 '24 20:11 jkrauss82

After even more digging and trial and error I found that when using the node TorchCompileModel the issue can be resolved and generated images are not black anymore. Downside is that compilation takes quite long (about 225s), after that further executions of the same workflow are fast. Prompt can be changed without re-compilation needed, resolution changes trigger re-compilation.

The speed up of compiled vs. non-compiled (but black image) is there but not as dramatic as fast vs. non-fast.

This observation makes me believe it should be possible to fix the issue somehow to make it work without the need for pre-compiling the model.

Nov 24 '24 16:11 jkrauss82

More observations:

latency of TorchCompileModel can be significantly reduced (less than half) setting the following env var: TORCHINDUCTOR_FX_GRAPH_CACHE=1 (docs)
we can use function decoration @torch.compile on fp8_linear in ops.py to remove the latency compared to using node TorchCompileModel for the initial generation, it solves the black images being generated as well, but it is not much faster than using the uncompiled graph with --fast as it only compiles fp8_linear leaving the rest of the graph alone
setting the torch compiler flag torch._dynamo.config.force_parameter_static_shapes = False allows to use different input tensors, e.g. changed image/latent size, without issue when using the function decoration
black image appears again e.g. when changing batch size or adding a lora as these actions trigger a re-compilation of the graph and this seems to bring out again whatever is causing the issue

Nov 30 '24 09:11 jkrauss82

Change this line: https://github.com/comfyanonymous/ComfyUI/blob/v0.3.6/comfy/ops.py#L272

to this:

inn = torch.clamp(input, min=-448, max=448).reshape(-1, input.shape[2]).to(dtype)

worked for me, with no apparent visual or speed degradation~

Basically, the input might overflow the value range of torch.float8_e4m3fn (ie. ±448), causing inn to be NaN, which in turn causes the calculated o to be NaN as well. Simply adding a clamp seems to fix the issue. (Does not need @torch.compile either)

Dec 04 '24 16:12 Haoming02

Change this line: https://github.com/comfyanonymous/ComfyUI/blob/v0.3.6/comfy/ops.py#L272 to this: inn = torch.clamp(input, min=-448, max=448).reshape(-1, input.shape[2]).to(dtype) worked for me, with no apparent visual or speed degradation~

I can confirm this is working, thank you very much for the suggestion! I think you should submit a PR for this.

I wonder if the clamping should not be applied for each of the processing steps within fp8_linear though, namely in this line as well.

Also, does torch handle values exceeding the available range for fp8 e4m3 itself when doing _scaled_mm here or should the output of this operation be clamped as well to be sure?

Dec 05 '24 09:12 jkrauss82

be applied for each of the processing steps

In my hours of testing last night, NaN only ever occurred within the if scale_input is None: path; looking at the else: path, the input is scaled by (1.0 / scale_input), which probably solves the overflow? Though, I did not actually check whether the else: path was ever executed.

does torch handle values exceeding the available range for fp8 itself when doing _scaled_mm

Since the _scaled_mm operation is essentially written for fp8 specifically, I'd assume the result will always be valid for fp8. (If you try to change the dtype of any of the parameter, the function actually raises errors)

Dec 05 '24 09:12 Haoming02

Since the _scaled_mm operation is essentially written for fp8 specifically, I'd assume the result will always be valid for fp8. (If you try to change the dtype of any of the parameter, the function actually raises errors)

I would assume the same. I have tried finding a clear comment / documentation in torch about this through web search but have found nothing really. The documentation of _scaled_mm is basically not existing so far or my search was bad.

I have tested with models known to cause black images a little more, this time without @torch.compile and images always returned as they should for all settings, batch sizes etc. when the clamping is applied.

So, I would conclude that it suffices to apply clamping in the path where there is no scaled input and leave the rest as is :+1:

Dec 05 '24 10:12 jkrauss82

fixed, thx @Haoming02

Dec 20 '24 19:12 bananasss00