ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

Blockswap ban renders wan unusable on 3080 ti

Open rugabunda opened this issue 1 month ago • 16 comments

Expected Behavior

8 second Wan video renders efficiently, and effectively.

Actual Behavior

Same workflow is now only capable of rendering 0.5 second videos; vram no longer optimized. Wan effectively broken.

Others have mentioned this issue here: https://github.com/orssorbit/ComfyUI-wanBlockswap/issues/7

Steps to Reproduce

Update comfyui.

Debug Logs

Whats to log?

Other

This blockswap mod particularly allowed me to use Wan in ComfyUI. Recent updates that lead to the wanblockswap ban render Wan unusable on 12 gb 3080 ti.

rugabunda avatar Nov 20 '25 12:11 rugabunda

If you want it fixed post your full logs and workflow.

comfyanonymous avatar Nov 20 '25 20:11 comfyanonymous

Just delete /comfy_extras/nodes_nop.py to make wanBlockSwap working again. I don't know why author consider it to be a "placebo" while it simply allows us to run models/resolutions/lengths that are impossible to run without BlockSwap.

abudfv20232323-spec avatar Nov 21 '25 12:11 abudfv20232323-spec

There may have been some updates to comfy, or custom nodes, or other issues within the workflow or pytorch/cuda versions management (such as switching from pytorch 2.9.1 to 2.7.1- user reports suggest the latter benchmarks faster), proper clearing of triton cache, starting with a fresh conda env, or some combo of these; because after some tweaking it seems to be close to the vram efficiency as it was before recent updates. I'm able to push out videos at 7.5 seconds, previously was able to get them at 8.0 seconds. Seems without this wanBlockSwap node and 64 gb of ram, in the workflows I've used so far its impossible to use wan gguf on 12GB 3080 ti. Comfyu's built in algorithm for offloading models doesn't cut it. If there is a way to make it work just as good or better in workflows without custom nodes, would like to know. So far I have not seen it.

rugabunda avatar Nov 21 '25 22:11 rugabunda

Here is the VRAM utilization of WAN 2.2 720P18f (5 seconds) on an RTX3060 12GB VRAM. I have no custom nodes, I just took the wan 2.2 I2V template and adjusted prompt and upped the resolution to 720P. No other changes. This screenshot is the RAM as is switches between High noise and Low noise models:

Image

No GGUF.

If you have a VRAM OOM is should have a log that backtraces the OOM. It means a lot to us to see your OOM and how it happened.

Your video resolution matters a lot. You have only so far let us know the duration.

rattus128 avatar Nov 22 '25 05:11 rattus128

Here is the VRAM utilization of WAN 2.2 720P18f (5 seconds) on an RTX3060 12GB VRAM. I have no custom nodes, I just took the wan 2.2 I2V template and adjusted prompt and upped the resolution to 720P. No other changes. This screenshot is the RAM as is switches between High noise and Low noise models:

No GGUF.

If you have a VRAM OOM is should have a log that backtraces the OOM. It means a lot to us to see your OOM and how it happened.

Your video resolution matters a lot. You have only so far let us know the duration.

I became attached to a pre-made workflow that is now outdated; the most popular one on civiati.

I just tested the same default workflow you suggested, and it worked, but not perfectly. Comfy's auto memory management is impressive, seemingly better than the wanBlockswap extensions. However performance is less consistent, and often does not use the full capacity of the GPU; On this card a good run reaches at least 77 degrees. Here it is often 66-69, doubling the time; this problem tends to go away when monitors are turned off.

That was using

Wan2_2-I2V-A14B-HIGH_fp8_e4m3fn_scaled_KJ.safetensors Wan2_2-I2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors

weight dtype set to fp8_e4m3fn_fast (with the bat flag); result 151.69s /it

Update: Tested Using

wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors wan2.2_j2v_low_noise_14B_fp8_scaled.safetensors

with default weight_dtype; this cut iterations from 5 to 3 minutes. (95.19) on first high run. Still not using full gpu capacity. Second run reached 67.53s/it on low run. This is about as good as it used to be in blockswap workflows.

But this is inconsistent...

Update:

Third run without changing previous settings is inconsistent; reaching 140 it/s, low temps 65 degrees; took 4:08 minutes on HIGH pass.

Low pass was best yet 65.03s/it (2:10/s); this is 8 second video @ 601x896; Thats with no custom nodes.

Its very inconsistent.

Is torchcompile baked in?

rugabunda avatar Nov 26 '25 03:11 rugabunda

Here is the VRAM utilization of WAN 2.2 720P18f (5 seconds) on an RTX3060 12GB VRAM. I have no custom nodes, I just took the wan 2.2 I2V template and adjusted prompt and upped the resolution to 720P. No other changes. This screenshot is the RAM as is switches between High noise and Low noise models: No GGUF. If you have a VRAM OOM is should have a log that backtraces the OOM. It means a lot to us to see your OOM and how it happened. Your video resolution matters a lot. You have only so far let us know the duration.

I became attached to a pre-made workflow that is now outdated; the most popular one on civiati.

I just tested the same default workflow you suggested, and it worked, but not perfectly. Comfy's auto memory management is impressive, seemingly better than the wanBlockswap extensions. However performance is less consistent, and often does not use the full capacity of the GPU; On this card a good run reaches at least 77 degrees. Here it is often 66-69, doubling the time; this problem tends to go away when monitors are turned off.

That was using

Wan2_2-I2V-A14B-HIGH_fp8_e4m3fn_scaled_KJ.safetensors Wan2_2-I2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors

weight dtype set to fp8_e4m3fn_fast (with the bat flag); result 151.69s /it

Update: Tested Using

wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors wan2.2_j2v_low_noise_14B_fp8_scaled.safetensors

with default weight_dtype; this cut iterations from 5 to 3 minutes. (95.19) on first high run. Still not using full gpu capacity. Second run reached 67.53s/it on low run. This is about as good as it used to be in blockswap workflows.

But this is inconsistent...

Update:

Third run without changing previous settings is inconsistent; reaching 140 it/s, low temps 65 degrees; took 4:08 minutes on HIGH pass.

Low pass was best yet 65.03s/it (2:10/s); this is 8 second video @ 601x896; Thats with no custom nodes.

Its very inconsistent.

Is torchcompile baked in?

No torch compiler is not baked in but you are correct that it can underutilize VRAM these days. Thanks for those stats.

Can you confirm you system RAM size, your operating system and the width and generation of your PCIe bus?

I have been meaning to look at further optimizing WAN native for speed sometime in the future and your data helps.

rattus128 avatar Nov 26 '25 06:11 rattus128

Here is the VRAM utilization of WAN 2.2 720P18f (5 seconds) on an RTX3060 12GB VRAM. I have no custom nodes, I just took the wan 2.2 I2V template and adjusted prompt and upped the resolution to 720P. No other changes. This screenshot is the RAM as is switches between High noise and Low noise models: No GGUF. If you have a VRAM OOM is should have a log that backtraces the OOM. It means a lot to us to see your OOM and how it happened. Your video resolution matters a lot. You have only so far let us know the duration.

I became attached to a pre-made workflow that is now outdated; the most popular one on civiati. I just tested the same default workflow you suggested, and it worked, but not perfectly. Comfy's auto memory management is impressive, seemingly better than the wanBlockswap extensions. However performance is less consistent, and often does not use the full capacity of the GPU; On this card a good run reaches at least 77 degrees. Here it is often 66-69, doubling the time; this problem tends to go away when monitors are turned off. That was using Wan2_2-I2V-A14B-HIGH_fp8_e4m3fn_scaled_KJ.safetensors Wan2_2-I2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors weight dtype set to fp8_e4m3fn_fast (with the bat flag); result 151.69s /it Update: Tested Using wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors wan2.2_j2v_low_noise_14B_fp8_scaled.safetensors with default weight_dtype; this cut iterations from 5 to 3 minutes. (95.19) on first high run. Still not using full gpu capacity. Second run reached 67.53s/it on low run. This is about as good as it used to be in blockswap workflows. But this is inconsistent... Update: Third run without changing previous settings is inconsistent; reaching 140 it/s, low temps 65 degrees; took 4:08 minutes on HIGH pass. Low pass was best yet 65.03s/it (2:10/s); this is 8 second video @ 601x896; Thats with no custom nodes. Its very inconsistent. Is torchcompile baked in?

No torch compiler is not baked in but you are correct that it can underutilize VRAM these days. Thanks for those stats.

Can you confirm you system RAM size, your operating system and the width and generation of your PCIe bus?

I have been meaning to look at further optimizing WAN native for speed sometime in the future and your data helps.

64 GB ram; 25H2; PCIe 5.0 x16 slot (primary GPU). Glad this helps and to contribute.

rugabunda avatar Nov 26 '25 08:11 rugabunda

@rattus128 I accidently hit close on this thread.

rugabunda avatar Nov 26 '25 08:11 rugabunda

Also noticing that the video output for the new default workflow for Comfyui results in relatively slower video every single time, tending towards slow motion.

rugabunda avatar Nov 26 '25 08:11 rugabunda

Another quirk: setting frames to 16/17 in the default workflow resulted in 97% vram usage every time and a stalled run, requiring comfy reboot. Setting it to 81 results in lower vram usage (86% currently) and no stalled run.

rugabunda avatar Nov 26 '25 08:11 rugabunda

I also had problems with my 4070 12gb after the ban, vram usage went above 12gb so speed went from ~45s/it to something like 5min/s. But i found out you can reserve more vram with the launch option --reserve-vram 2. The default is 1 and it was not enough for me and caused slowdown. It would’ve been nice to have this configurable in the settings, but at least it works now.

ShockerV avatar Nov 28 '25 09:11 ShockerV

I also had problems with my 4070 12gb after the ban, vram usage went above 12gb so speed went from ~45s/it to something like 5min/s. But i found out you can reserve more vram with the launch option --reserve-vram 2. The default is 1 and it was not enough for me and caused slowdown. It would’ve been nice to have this configurable in the settings, but at least it works now.

can I get your workflow?

rattus128 avatar Nov 28 '25 09:11 rattus128

also @ShockerV it sounds like you are spilling into GPU shared memory. Can I get a screenshot of the GPU VRAM and shared mem numbers when its running that slow?

rattus128 avatar Nov 28 '25 09:11 rattus128

also @ShockerV it sounds like you are spilling into GPU shared memory. Can I get a screenshot of the GPU VRAM and shared mem numbers when its running that slow?

I don't remember if it was wan2.2 or qwen that i used but i tried to replicate it by having vram at 1.6gb and trying same workflow with 2 and 1 reserve vram. Wan went from 77.05s/it to 100.81s/it with shared gpu ~0.5gb and qwen went from 13.15s/it to 25.40s/it with a couple of loras and union controlnet. Then i tried using wan with ~300mb more vram using lm studio midway a generation, shared gpu went to ~0.7gb and speed to 270s/it. So maybe a program used more vram just as i started and caused the slowdown. Or maybe this #10733 update fixed it?

Either way, since i'm using my pc while generating i go a bit over 1gb vram so it's better for me to be on the safe side and use 2 in reserve-vram.

ShockerV avatar Nov 29 '25 11:11 ShockerV

Here is the VRAM utilization of WAN 2.2 720P18f (5 seconds) on an RTX3060 12GB VRAM. I have no custom nodes, I just took the wan 2.2 I2V template and adjusted prompt and upped the resolution to 720P. No other changes. This screenshot is the RAM as is switches between High noise and Low noise models: No GGUF. If you have a VRAM OOM is should have a log that backtraces the OOM. It means a lot to us to see your OOM and how it happened. Your video resolution matters a lot. You have only so far let us know the duration.

I became attached to a pre-made workflow that is now outdated; the most popular one on civiati. I just tested the same default workflow you suggested, and it worked, but not perfectly. Comfy's auto memory management is impressive, seemingly better than the wanBlockswap extensions. However performance is less consistent, and often does not use the full capacity of the GPU; On this card a good run reaches at least 77 degrees. Here it is often 66-69, doubling the time; this problem tends to go away when monitors are turned off. That was using Wan2_2-I2V-A14B-HIGH_fp8_e4m3fn_scaled_KJ.safetensors Wan2_2-I2V-A14B-LOW_fp8_e4m3fn_scaled_KJ.safetensors weight dtype set to fp8_e4m3fn_fast (with the bat flag); result 151.69s /it Update: Tested Using wan2.2_i2v_high_noise_14B_fp8_scaled.safetensors wan2.2_j2v_low_noise_14B_fp8_scaled.safetensors with default weight_dtype; this cut iterations from 5 to 3 minutes. (95.19) on first high run. Still not using full gpu capacity. Second run reached 67.53s/it on low run. This is about as good as it used to be in blockswap workflows. But this is inconsistent... Update: Third run without changing previous settings is inconsistent; reaching 140 it/s, low temps 65 degrees; took 4:08 minutes on HIGH pass. Low pass was best yet 65.03s/it (2:10/s); this is 8 second video @ 601x896; Thats with no custom nodes. Its very inconsistent. Is torchcompile baked in?

No torch compiler is not baked in but you are correct that it can underutilize VRAM these days. Thanks for those stats.

Can you confirm you system RAM size, your operating system and the width and generation of your PCIe bus?

I have been meaning to look at further optimizing WAN native for speed sometime in the future and your data helps.

Talking about Torch Compile, I’m not sure if this is normal thing or if something’s wrong.

I have an RTX 3070 with 8GB VRAM, and I can generate I2V videos at a maximum of 576×1024 @81 frames.

When using Torch Compile with Q8 GGUF+lora, the first run takes around 90-100s/it to reaches the 1st step, but the subsequent runs drop to about 45s/it for high/low noise.

But when using FP8 (either FP8 scaled, E4M3FN, or E5M2), the Torch Compile recompilation logs are way longer compared to GGUF Q8, it even override the ComfyUI startup logs off the command prompt. And the results are slower, FP8+loras takes around 250–300s/it to reach the 1st step, and then it got faster around 40–43s/it on the subsequent runs.

And then, when not using any loras + fp8 (run the base model only), the recompilation also much shorter, and the time to reach the 1st step simillar to gguf q8 + lora, around 90-100s/it.

Is this normal? I’m using Torch Compile from KJ nodes btw.

I posted the full report on the Triton-Windows repo, but it seems like nobody knows https://github.com/woct0rdho/triton-windows/issues/168 also idk if that's the right place to ask...

If there's something wrong with this, i can do more detailed new test if you need.

muljanis45 avatar Dec 05 '25 12:12 muljanis45

Just delete /comfy_extras/nodes_nop.py to make wanBlockSwap working again. I don't know why author consider it to be a "placebo" while it simply allows us to run models/resolutions/lengths that are impossible to run without BlockSwap.只需删除 /comfy_extras/nodes_nop.py,wanBlockSwap 就能重新工作。我不明白为什么作者会把它当作“安慰剂效应”,而它只是让我们能够运行模型、分辨率和长度,而这些都是没有 BlockSwap 无法运行的。

确实,我也想不通,自从屏蔽后,就爆显存,没屏蔽之前 v0.368 版不爆显存(退回这个版本也没问题),16G显存,24帧 ,5秒没爆过,屏蔽之后怎么改都不行,真的很无语

boartxp avatar Dec 06 '25 08:12 boartxp