ComfyUI Extremely Slow VAE decode using Qwen model after the 0.3.54 update

Custom Node Testing

[ ] I have tried disabling custom nodes and the issue persists (see how to disable custom nodes if you need help)

Expected Behavior

Normal VAE decode speed like Flux? This image shows Flux Krea Generation time

Flux Krea Gen time Ksampler: 21.415 seconds VAE Decode: 0.392 seconds Less than 1 seconds is usually the speed of vae decode process

Actual Behavior

After the ComfyUI 0.3.54 update Why only Qwen VAE decode is this slow? the very same resolution as above? If I use Nunchaku Qwen Image fp4 4 steps model, my Ksampler generation time is 4.8 seconds, my VAE decode time is 3 seconds! Almost identical to the ksampler speed? How? It wasnt this slow before?

Qwen Image Fp8

Qwen Image Fp8 gen time Ksampler: 54.697 seconds VAE Decode: 3.851 seconds (almost four times slower than Flux)

Nunchaku Qwen Image with 4 Step Lightning

Qwen Image Nunchaku 4 steps gen time Ksampler: 4.472 seconds VAE Decode: 2.151 seconds

Steps to Reproduce

Just test with the sample workflow from templete, After the ComfyUI 0.3.54 update

Debug Logs

model_type FLUX
Requested to load Flux
loaded completely 16174.2524269104 122.3087158203125 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:08<00:00,  2.28it/s]
Requested to load AutoencodingEngine
loaded completely 8814.538383483887 159.87335777282715 True
comfyui lumi batcher overwrite task done
Prompt executed in 19.60 seconds
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Using scaled fp8: fp8 matrix mult: False, scale input: False
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load QwenImageTEModel_
loaded completely 22496.484112548827 7909.737449645996 True
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load QwenImage
loaded completely 22016.08407058716 19483.948791503906 True
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:27<00:00,  2.79s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially 128.0 127.9998779296875 0
comfyui lumi batcher overwrite task done
Prompt executed in 47.25 seconds
got prompt
Requested to load QwenImage
loaded completely 28597.275080108644 19483.948791503906 True
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:27<00:00,  2.79s/it]
0 models unloaded.
loaded partially 128.0 127.9998779296875 0
comfyui lumi batcher overwrite task done
Prompt executed in 34.41 seconds
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Using scaled fp8: fp8 matrix mult: False, scale input: False
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
clip missing: ['text_projection.weight']
Requested to load FluxClipModel_
loaded completely 29205.675 4903.231597900391 True
Using scaled fp8: fp8 matrix mult: True, scale input: True
model weight dtype torch.bfloat16, manual cast: None
model_type FLUX
Requested to load Flux
loaded completely 22883.44331436157 11350.088394165039 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:19<00:00,  1.03it/s]
Requested to load AutoencodingEngine
loaded completely 4173.6408767700195 159.87335777282715 True
comfyui lumi batcher overwrite task done
Prompt executed in 29.06 seconds
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Using scaled fp8: fp8 matrix mult: False, scale input: False
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load QwenImageTEModel_
loaded completely 29205.675 7909.737449645996 True
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Requested to load QwenImage
loaded completely 22764.77462615967 19483.948791503906 True
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:51<00:00,  2.57s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially 128.0 127.9998779296875 0
comfyui lumi batcher overwrite task done
Prompt executed in 65.60 seconds

Other

No response

Aug 28 '25 15:08 LiJT

Thanks for heads up, i will downgrade

Sep 10 '25 20:09 trollver9000

Same problem with qwen gguf image 4 steps. how i can downgrade ? thanks

Dec 16 '25 11:12 gandolfi974