ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

[Performance Regression] GGUF models run noticeably slower with only ~50% VRAM usage after removing WAN Block Swap node (compared to previous manual offloading behavior at 80-90% VRAM)

Open Mr-small-2-six opened this issue 3 weeks ago • 20 comments

Custom Node Testing

Expected Behavior

Expected Behavior In NORMAL_VRAM mode on a 16GB RTX 4080, GGUF models should fully utilize available VRAM (typically 13–15 GB, or 80–95%) just like the old manual block-offloading or the previous WAN Block Swap implementation did. This higher VRAM occupancy historically resulted in significantly faster generation speeds (higher it/s). After the removal of WAN Block Swap, I expected the new async weight offloading system to maintain similar performance and VRAM utilization, not drop to only ~50% VRAM usage with a noticeable slowdown.

Actual Behavior

Environment

GPU: NVIDIA GeForce RTX 4080 16GB Total VRAM: 16376 MB Driver: 576.88 PyTorch: 2.8.0+cu129 xFormers: 0.0.33+5d4b92a.d20251203 ComfyUI version: v0.3.76-14-g519c9411 (2025-12-03) VRAM state: NORMAL_VRAM (no --lowvram / --medvram flags) Relevant extensions: ComfyUI-Easy-Use v1.3.4, ComfyUI-Impact-Pack V8.28, ComfyUI-Impact-Subpack V1.3.5, Crystools 1.27.4 Using async weight offloading with 2 streams + pinned memory

Problem Description After the WAN Block Swap node was removed/deprecated, GGUF models now only use ~45-55% of VRAM (~7-9 GB on a 16GB 4080) even in NORMAL_VRAM mode. Subjectively, generation speed feels significantly slower than before, when I manually offloaded parts of the model to system RAM and VRAM usage routinely reached 80-90% (13-14 GB). Back then iteration times were faster despite the higher VRAM usage. Question

Is this massive drop in VRAM utilization (and the accompanying slowdown) intended behavior or a regression/bug in the new async weight offloading implementation? Will the WAN Block Swap node (or an equivalent manual block-level offloading control) ever come back? Many users with 16GB cards preferred the old behavior because it gave higher throughput.

Reproduction

Load any Q4/Q5 GGUF model wan 2.2 model Run a simple i2v workflow in NORMAL_VRAM Observe VRAM usage stays below ~9 GB and it/s is obviously lower than previous versions where VRAM usage was allowed to go much higher via manual offloading.

Thanks for any clarification!

Steps to Reproduce

run any workflow

Debug Logs

got prompt
loaded partially; 2654.55 MB usable, 0.00 MB loaded, 8475.47 MB offloaded, 2700.53 MB buffer reserved, lowvram patches: 0
100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [05:39<00:00, 33.90s/it]
Requested to load WAN21

Other

No response

Mr-small-2-six avatar Dec 03 '25 12:12 Mr-small-2-six

It's not just with GGUF models (and not just Wan). I just made some tests, I was about to open a new bug report, but I guess I could post it here.

@comfyanonymous @rattus128 ComfyUI updated to this commit. Used --disable-pinned-memory and --disable-async-offload for the test.

My specs: RTX3070Ti 8GB + 64GB RAM Pytorch 2.9.1+cu130

There is still a problem with the offloading. For the test I used Wan2.2 I2V template workflow, generated a 640x480 81 frames video.

VRAM usage with all fp16 models (high, low, TE):

With lightx2v LoRAs: 4.2/8.0GB

loaded partially; 1794.55 MB usable, 818.47 MB loaded, 26434.50 MB offloaded, 945.03 MB buffer reserved, lowvram patches: 394
loaded partially; 1787.55 MB usable, 818.47 MB loaded, 26434.50 MB offloaded, 945.03 MB buffer reserved, lowvram patches: 394

Generation speed: 40.13s/it

Without LoRAs: 5.0/8.0GB

loaded partially; 1787.55 MB usable, 1652.52 MB loaded, 25600.45 MB offloaded, 135.03 MB buffer reserved, lowvram patches: 0
loaded partially; 1787.55 MB usable, 1652.52 MB loaded, 25600.45 MB offloaded, 135.03 MB buffer reserved, lowvram patches: 0

Generation speed: 74.15s/it

VRAM usage with all fp8_scaled models (high, low, TE):

With lightx2v LoRAs: 4.3/8.0GB

loaded partially; 1787.55 MB usable, 909.42 MB loaded, 12719.66 MB offloaded, 877.51 MB buffer reserved, lowvram patches: 386
loaded partially; 1787.55 MB usable, 909.42 MB loaded, 12719.66 MB offloaded, 877.51 MB buffer reserved, lowvram patches: 386

Generation speed: 41.99s/it

Without LoRAs: 5.1/8.0GB

loaded partially; 1787.55 MB usable, 1709.58 MB loaded, 11919.50 MB offloaded, 67.51 MB buffer reserved, lowvram patches: 0
loaded partially; 1787.55 MB usable, 1709.58 MB loaded, 11919.50 MB offloaded, 67.51 MB buffer reserved, lowvram patches: 0

Generation speed: 70.44s/it

LukeG89 avatar Dec 03 '25 14:12 LukeG89

My ComfyUI also suffer VRAM fluctuated, somehow ComfyUI won't load checkpoint immediately like in the old version it calculating something with memory RAM&VRAM awhile maybe for 3 - 5 minutes before loading the model into VRAM and generates. Even in vae decode, it take more times to calculate something in VRAM and RAM than before. And it happens with Illustrious checkpoint and lora. VRAM increase to maxium the reduce to minimum within 1 or 2 seconds even with small model like SAM when I use facedetailer. And after updated even I roll back to older version this phenomenon still appears.

pivtienduc avatar Dec 03 '25 16:12 pivtienduc

I'm experiencing an extreme slow-down with ComfyUI 0.3.76 with a workflow which does nothing else but loads a umt5 Q4 GGUF CLIP to device cuda:0 (MultiGPU node), encodes a text and saves the condition to disk (Condition-Utils node).

loaded partially: 3114.55 MB usable, 1106.95 MB loaded, 3404.00 MB offloaded, 2003.00 MB buffer reserved, lowvram patches: 0
Interrupting prompt [...]
Processing interrupted
Prompt executed in 00:27:50

This used to take only a few minutes from a cold start, most of which was the model loading into memory. Now it encodes the text seemingly using only CPU, which is very slow on my system while GPU utilization is 0%

I'm not sure how to test this without a custom node because the default behavior in recent releases seems to be to always run the CLIP on CPU which makes it unusable for me without forcing it to GPU.

0.3.75;

loaded partially: 2740.80 MB usable, 2738.95 MB loaded, 1772.00 MB offloaded, lowvram patches: 0
Condition tensor saved to [...]
Prompt executed in 213.15 seconds

hum-ma avatar Dec 03 '25 20:12 hum-ma

I'm experiencing an extreme slow-down with ComfyUI 0.3.76 with a workflow which does nothing else but loads a umt5 Q4 GGUF CLIP to device cuda:0 (MultiGPU node), encodes a text and saves the condition to disk (Condition-Utils node).

loaded partially: 3114.55 MB usable, 1106.95 MB loaded, 3404.00 MB offloaded, 2003.00 MB buffer reserved, lowvram patches: 0
Interrupting prompt [...]
Processing interrupted
Prompt executed in 00:27:50

This used to take only a few minutes from a cold start, most of which was the model loading into memory. Now it encodes the text seemingly using only CPU, which is very slow on my system while GPU utilization is 0%

I'm not sure how to test this without a custom node because the default behavior in recent releases seems to be to always run the CLIP on CPU which makes it unusable for me without forcing it to GPU.

0.3.75;

loaded partially: 2740.80 MB usable, 2738.95 MB loaded, 1772.00 MB offloaded, lowvram patches: 0
Condition tensor saved to [...]
Prompt executed in 213.15 seconds

Can you retry this without the MultiGPU loader? This is likely to have complex interactions with some recent changes. Cut a fresh issue without MultiGPU and the workflow or even just a screenshot if its only a few small nodes if you still get this big clip

I'm interested if you have any difference without GGUF too.

rattus128 avatar Dec 03 '25 21:12 rattus128

Custom Node Testing

Expected Behavior

Expected Behavior In NORMAL_VRAM mode on a 16GB RTX 4080, GGUF models should fully utilize available VRAM (typically 13–15 GB, or 80–95%) just like the old manual block-offloading or the previous WAN Block Swap implementation did. This higher VRAM occupancy historically resulted in significantly faster generation speeds (higher it/s). After the removal of WAN Block Swap, I expected the new async weight offloading system to maintain similar performance and VRAM utilization, not drop to only ~50% VRAM usage with a noticeable slowdown.

Actual Behavior

Environment

GPU: NVIDIA GeForce RTX 4080 16GB Total VRAM: 16376 MB Driver: 576.88 PyTorch: 2.8.0+cu129 xFormers: 0.0.33+5d4b92a.d20251203 ComfyUI version: v0.3.76-14-g519c9411 (2025-12-03) VRAM state: NORMAL_VRAM (no --lowvram / --medvram flags) Relevant extensions: ComfyUI-Easy-Use v1.3.4, ComfyUI-Impact-Pack V8.28, ComfyUI-Impact-Subpack V1.3.5, Crystools 1.27.4 Using async weight offloading with 2 streams + pinned memory

Problem Description After the WAN Block Swap node was removed/deprecated, GGUF models now only use ~45-55% of VRAM (~7-9 GB on a 16GB 4080) even in NORMAL_VRAM mode. Subjectively, generation speed feels significantly slower than before, when I manually offloaded parts of the model to system RAM and VRAM usage routinely reached 80-90% (13-14 GB). Back then iteration times were faster despite the higher VRAM usage. Question

Is this massive drop in VRAM utilization (and the accompanying slowdown) intended behavior or a regression/bug in the new async weight offloading implementation? Will the WAN Block Swap node (or an equivalent manual block-level offloading control) ever come back? Many users with 16GB cards preferred the old behavior because it gave higher throughput.

Reproduction

Load any Q4/Q5 GGUF model wan 2.2 model Run a simple i2v workflow in NORMAL_VRAM Observe VRAM usage stays below ~9 GB and it/s is obviously lower than previous versions where VRAM usage was allowed to go much higher via manual offloading.

Thanks for any clarification!

Steps to Reproduce

run any workflow

Debug Logs

got prompt loaded partially; 2654.55 MB usable, 0.00 MB loaded, 8475.47 MB offloaded, 2700.53 MB buffer reserved, lowvram patches: 0 100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [05:39<00:00, 33.90s/it] Requested to load WAN21

Other

No response

Your estimated inference VRAM is extremely high, what is your resolution and frame count (and batch)?

I fixed a bug in the reservation estimator last night that might help you use more VRAM but not by that much.

Is there a lora in play?

The estimation can be a bit low on high res wan and it's due for a recalibration. I'll look into that. A workflow that exactly matches you data point on 4080 16GB would help

rattus128 avatar Dec 03 '25 21:12 rattus128

It's not just with GGUF models (and not just Wan). I just made some tests, I was about to open a new bug report, but I guess I could post it here.

@comfyanonymous @rattus128 ComfyUI updated to this commit. Used --disable-pinned-memory and --disable-async-offload for the test.

My specs: RTX3070Ti 8GB + 64GB RAM Pytorch 2.9.1+cu130

There is still a problem with the offloading. For the test I used Wan2.2 I2V template workflow, generated a 640x480 81 frames video.

VRAM usage with all fp16 models (high, low, TE):

With lightx2v LoRAs: 4.2/8.0GB

loaded partially; 1794.55 MB usable, 818.47 MB loaded, 26434.50 MB offloaded, 945.03 MB buffer reserved, lowvram patches: 394
loaded partially; 1787.55 MB usable, 818.47 MB loaded, 26434.50 MB offloaded, 945.03 MB buffer reserved, lowvram patches: 394

Generation speed: 40.13s/it

Without LoRAs: 5.0/8.0GB

loaded partially; 1787.55 MB usable, 1652.52 MB loaded, 25600.45 MB offloaded, 135.03 MB buffer reserved, lowvram patches: 0
loaded partially; 1787.55 MB usable, 1652.52 MB loaded, 25600.45 MB offloaded, 135.03 MB buffer reserved, lowvram patches: 0

Generation speed: 74.15s/it

VRAM usage with all fp8_scaled models (high, low, TE):

With lightx2v LoRAs: 4.3/8.0GB

loaded partially; 1787.55 MB usable, 909.42 MB loaded, 12719.66 MB offloaded, 877.51 MB buffer reserved, lowvram patches: 386
loaded partially; 1787.55 MB usable, 909.42 MB loaded, 12719.66 MB offloaded, 877.51 MB buffer reserved, lowvram patches: 386

Generation speed: 41.99s/it

Without LoRAs: 5.1/8.0GB

loaded partially; 1787.55 MB usable, 1709.58 MB loaded, 11919.50 MB offloaded, 67.51 MB buffer reserved, lowvram patches: 0
loaded partially; 1787.55 MB usable, 1709.58 MB loaded, 11919.50 MB offloaded, 67.51 MB buffer reserved, lowvram patches: 0

Generation speed: 70.44s/it

Do you have a comparison basis for an older version or is it just about the lora performance discrepancy?

Offloaded lora cost both extra bus and compute traffic and we have the dilemma of either doing per step lora calculation to avoid having to store a full copy of the model it ram or leaving it offloaded in pieces (which we do). Lora pinning will be added which should help.

If you have a non changing lora setup, what you can do though to make this difference go away is merge your lora to the model. Load model > load lora > save model. This is just a workaround though.

rattus128 avatar Dec 03 '25 22:12 rattus128

Do you have a comparison basis for an older version or is it just about the lora performance discrepancy?

No, unfortunately I don't have a comparison with previous versions. A couple of days ago I tried to downgrade ComfyUI for the sake of testing it, but apparently I also have to downgrade other things (like Template package) in order to make ComfyUI start up. So I just gave up 😆

But I know for sure that VRAM was fully used in the past, about 7.4/8.0GB (at 7.6 was usually OOM)

Offloaded lora cost both extra bus and compute traffic and we have the dilemma of either doing per step lora calculation to avoid having to store a full copy of the model it ram or leaving it offloaded in pieces (which we do). Lora pinning will be added which should help.

If you have a non changing lora setup, what you can do though to make this difference go away is merge your lora to the model. Load model > load lora > save model. This is just a workaround though.

Adding LoRAs has a big impact on that, but as you can see from my simple tests, I run the workflow with and without lightx2v LoRAs (no additional LoRAs were added to the template) and in both cases, VRAM usage has been reduced quite a lot compared to my usual expectation (~5GB instead of ~7GB). Just this aspect is affecting inference time to me and other users on a daily basis.

To find a temporary solution to this problem, I recently experimented with --reserve-vram by adding negative values (e.g. --reserve-vram -1) to "trick" the current memory estimation, and depending on the workflow and models I was using, I could make full use of GPU memory (with the downside of possible OOMs). But that's just a funny trick 😃

LukeG89 avatar Dec 03 '25 23:12 LukeG89

Do you have a comparison basis for an older version or is it just about the lora performance discrepancy?

No, unfortunately I don't have a comparison with previous versions. A couple of days ago I tried to downgrade ComfyUI for the sake of testing it, but apparently I also have to downgrade other things (like Template package) in order to make ComfyUI start up. So I just gave up 😆

But I know for sure that VRAM was fully used in the past, about 7.4/8.0GB (at 7.6 was usually OOM)

Offloaded lora cost both extra bus and compute traffic and we have the dilemma of either doing per step lora calculation to avoid having to store a full copy of the model it ram or leaving it offloaded in pieces (which we do). Lora pinning will be added which should help. If you have a non changing lora setup, what you can do though to make this difference go away is merge your lora to the model. Load model > load lora > save model. This is just a workaround though.

Adding LoRAs has a big impact on that, but as you can see from my simple tests, I run the workflow with and without lightx2v LoRAs (no additional LoRAs were added to the template) and in both cases, VRAM usage has been reduced quite a lot compared to my usual expectation (~5GB instead of ~7GB). Just this aspect is affecting inference time to me and other users on a daily basis.

To find a temporary solution to this problem, I recently experimented with --reserve-vram by adding negative values (e.g. --reserve-vram -1) to "trick" the current memory estimation, and depending on the workflow and models I was using, I could make full use of GPU memory (with the downside of possible OOMs). But that's just a funny trick 😃

How's the performance difference with --reserve-vram -1?

rattus128 avatar Dec 04 '25 00:12 rattus128

It's not just with GGUF models (and not just Wan). I just made some tests, I was about to open a new bug report, but I guess I could post it here.

@comfyanonymous @rattus128 ComfyUI updated to this commit. Used --disable-pinned-memory and --disable-async-offload for the test.

@LukeG89 these options are performance critical, especially pinned memory. If you have a problem with these options left on we can take that as a straight bug report,

Here is my 3060 (only in a PCIe4 x4) with --reserve-vram 4.4 (this reduces my usable VRAM to 8GB - similar to your 3070ti)

Requested to load WAN21
loaded partially; 2580.70 MB usable, 2162.17 MB loaded, 25089.24 MB offloaded, 405.08 MB buffer reserved, lowvram patches: 0
100%|██████████| 2/2 [01:32<00:00, 46.24s/it]
Prompt executed in 135.53 seconds

Here is --reserve-vram 3 which is the equivalent of your -1 (as I have 4GB more VRAM that you):

Requested to load WAN21
loaded partially; 4014.30 MB usable, 3597.45 MB loaded, 23653.96 MB offloaded, 405.08 MB buffer reserved, lowvram patches: 0
100%|██████████| 2/2 [01:31<00:00, 45.97s/it]

This uses 7.7GB of VRAM. I dont get any major slow down.

I buffer reserve a little more than you but you should be outrunning me significantly on compute power and I assume you have your card in a working x16 primary slot? Try without those performance disable flags and else us know if --reserve-vram level (incl negative) is making a difference.

rattus128 avatar Dec 04 '25 01:12 rattus128

It's not just with GGUF models (and not just Wan). I just made some tests, I was about to open a new bug report, but I guess I could post it here. @comfyanonymous @rattus128 ComfyUI updated to this commit. Used --disable-pinned-memory and --disable-async-offload for the test.

@LukeG89 these options are performance critical, especially pinned memory. If you have a problem with these options left on we can take that as a straight bug report,

Here is my 3060 (only in a PCIe4 x4) with --reserve-vram 4.4 (this reduces my usable VRAM to 8GB - similar to your 3070ti)

Requested to load WAN21
loaded partially; 2580.70 MB usable, 2162.17 MB loaded, 25089.24 MB offloaded, 405.08 MB buffer reserved, lowvram patches: 0
100%|██████████| 2/2 [01:32<00:00, 46.24s/it]
Prompt executed in 135.53 seconds

Here is --reserve-vram 3 which is the equivalent of your -1 (as I have 4GB more VRAM that you):

Requested to load WAN21
loaded partially; 4014.30 MB usable, 3597.45 MB loaded, 23653.96 MB offloaded, 405.08 MB buffer reserved, lowvram patches: 0
100%|██████████| 2/2 [01:31<00:00, 45.97s/it]

This uses 7.7GB of VRAM. I dont get any major slow down.

I buffer reserve a little more than you but you should be outrunning me significantly on compute power and I assume you have your card in a working x16 primary slot? Try without those performance disable flags and else us know if --reserve-vram level (incl negative) is making a difference.

What does async offload actually do? It’s the default option now, so I assumed it would boost performance… but for me it feels the opposite.

In my usual text2image workflow (SDXL/ZImage + Wan 2.2 low-noise), things are slower. The Wan text encoder takes way longer with async offload turned on. I also noticed the “buffer reserved” number gets bigger when async offload is enabled and and the WAN text encoder shows 0.00 MB loaded.

async offload disable

Requested to load Lumina2
FETCH ComfyRegistry Data: 50/110
loaded partially; 5600.80 MB usable, 5074.88 MB loaded, 6664.67 MB offloaded, 525.00 MB buffer reserved, lowvram patches: 114
100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:36<00:00,  4.02s/it]
Requested to load AutoencodingEngine
Unloaded partially: 4906.13 MB freed, 168.75 MB remains loaded, 590.62 MB buffer reserved, lowvram patches: 178
loaded completely; 529.00 MB usable, 159.87 MB loaded, full load: True

0: 1024x800 1 Not very manly man face, 60.3ms
Speed: 29.1ms preprocess, 60.3ms inference, 7.3ms postprocess per image at shape (1, 3, 1024, 800)

0: 1024x800 1 Not very manly man face, 9.9ms
Speed: 4.8ms preprocess, 9.9ms inference, 2.2ms postprocess per image at shape (1, 3, 1024, 800)
Requested to load ZImageTEModel_
Unloaded partially: 84.38 MB freed, 84.38 MB remains loaded, 590.62 MB buffer reserved, lowvram patches: 179
FETCH ComfyRegistry Data: 95/110
loaded completely; 5331.37 MB usable, 4816.76 MB loaded, full load: True
Unloaded partially: 2403.48 MB freed, 2413.28 MB remains loaded, 25.23 MB buffer reserved, lowvram patches: 0
Requested to load Lumina2
loaded partially; 5575.62 MB usable, 5050.62 MB loaded, 6688.93 MB offloaded, 525.00 MB buffer reserved, lowvram patches: 114
100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:14<00:00,  1.56s/it]
Requested to load AutoencodingEngine
Unloaded partially: 3531.87 MB freed, 1518.75 MB remains loaded, 590.62 MB buffer reserved, lowvram patches: 162
loaded completely; 518.07 MB usable, 159.87 MB loaded, full load: True
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 7.59 MB buffer reserved, lowvram patches: 0
gguf qtypes: Q8_0 (169), F32 (73)
Attempting to recreate sentencepiece tokenizer from GGUF file metadata...
Created tokenizer with vocab size of 256384
Dequantizing token_embd.weight to prevent runtime OOM.
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
model weight dtype torch.float8_e5m2, manual cast: torch.float16
model_type FLOW
Requested to load WanTEModel
loaded partially; 5573.62 MB usable, 5556.62 MB loaded, 1139.18 MB offloaded, 17.00 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (10)
0 models unloaded.
Unloaded partially: 34.62 MB freed, 5522.00 MB remains loaded, 17.00 MB buffer reserved, lowvram patches: 0
Requested to load WAN21
loaded partially; 5264.67 MB usable, 4387.10 MB loaded, 9239.23 MB offloaded, 877.56 MB buffer reserved, lowvram patches: 885
(RES4LYF) rk_type: res_2s
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:11<00:00, 17.80s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 7.59 MB buffer reserved, lowvram patches: 0

0: 1024x800 1 Not very manly man face, 39.4ms
Speed: 59.9ms preprocess, 39.4ms inference, 20.4ms postprocess per image at shape (1, 3, 1024, 800)

0: 1024x800 1 Not very manly man face, 8.9ms
Speed: 4.5ms preprocess, 8.9ms inference, 2.4ms postprocess per image at shape (1, 3, 1024, 800)
Requested to load WanTEModel
loaded partially; 5573.62 MB usable, 5556.62 MB loaded, 1139.18 MB offloaded, 17.00 MB buffer reserved, lowvram patches: 0
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 7.59 MB buffer reserved, lowvram patches: 0
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 7.59 MB buffer reserved, lowvram patches: 0
Requested to load WAN21
loaded partially; 5573.62 MB usable, 4696.05 MB loaded, 8930.28 MB offloaded, 877.56 MB buffer reserved, lowvram patches: 905
(RES4LYF) rk_type: res_2s
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:42<00:00, 10.56s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 7.59 MB buffer reserved, lowvram patches: 0
# 😺dzNodes: LayerStyle -> AddGrain Processed 1 image(s).
WAS Node Suite: Image file saved to: F:\AI\ComfyUI-Nightly\ComfyUI\output\Image\Zimage\2025-12-04\img_0002.png
Prompt executed in 306.63 seconds

with async offload

Requested to load Lumina2
loaded partially; 5600.80 MB usable, 4924.88 MB loaded, 6814.67 MB offloaded, 675.00 MB buffer reserved, lowvram patches: 116
100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:35<00:00,  3.90s/it]
Requested to load AutoencodingEngine
Unloaded partially: 4671.76 MB freed, 253.12 MB remains loaded, 759.38 MB buffer reserved, lowvram patches: 177
loaded completely; 444.62 MB usable, 159.87 MB loaded, full load: True

0: 1024x800 1 Not very manly man face, 75.2ms
Speed: 35.4ms preprocess, 75.2ms inference, 18.2ms postprocess per image at shape (1, 3, 1024, 800)

0: 1024x800 1 Not very manly man face, 9.8ms
Speed: 6.2ms preprocess, 9.8ms inference, 2.4ms postprocess per image at shape (1, 3, 1024, 800)
Requested to load ZImageTEModel_
Unloaded partially: 168.75 MB freed, 84.38 MB remains loaded, 759.38 MB buffer reserved, lowvram patches: 179
loaded completely; 5331.37 MB usable, 4816.76 MB loaded, full load: True
Unloaded partially: 2327.78 MB freed, 2488.98 MB remains loaded, 75.70 MB buffer reserved, lowvram patches: 0
Requested to load Lumina2
loaded partially; 5575.62 MB usable, 4900.62 MB loaded, 6838.93 MB offloaded, 675.00 MB buffer reserved, lowvram patches: 116
100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:13<00:00,  1.52s/it]
Requested to load AutoencodingEngine
Unloaded partially: 3297.50 MB freed, 1603.12 MB remains loaded, 759.38 MB buffer reserved, lowvram patches: 161
loaded completely; 433.70 MB usable, 159.87 MB loaded, full load: True
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
gguf qtypes: Q8_0 (169), F32 (73)
Attempting to recreate sentencepiece tokenizer from GGUF file metadata...
Created tokenizer with vocab size of 256384
Dequantizing token_embd.weight to prevent runtime OOM.
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
model weight dtype torch.float8_e5m2, manual cast: torch.float16
model_type FLOW
Requested to load WanTEModel
loaded partially; 5573.62 MB usable, 0.00 MB loaded, 6695.19 MB offloaded, 6009.00 MB buffer reserved, lowvram patches: 0
Attempting to release mmap (49)
loaded partially; 5573.62 MB usable, 0.00 MB loaded, 6695.19 MB offloaded, 6009.00 MB buffer reserved, lowvram patches: 0
Requested to load WAN21
loaded partially; 5264.67 MB usable, 4252.09 MB loaded, 9374.24 MB offloaded, 1012.57 MB buffer reserved, lowvram patches: 889
(RES4LYF) rk_type: res_2s
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:16<00:00, 19.25s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0

0: 1024x800 1 Not very manly man face, 68.3ms
Speed: 73.3ms preprocess, 68.3ms inference, 36.0ms postprocess per image at shape (1, 3, 1024, 800)

0: 1024x800 1 Not very manly man face, 14.1ms
Speed: 5.8ms preprocess, 14.1ms inference, 4.4ms postprocess per image at shape (1, 3, 1024, 800)
Requested to load WanTEModel
loaded partially; 5573.62 MB usable, 0.00 MB loaded, 6695.19 MB offloaded, 6009.00 MB buffer reserved, lowvram patches: 0
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
Requested to load WAN21
loaded partially; 5573.62 MB usable, 4561.04 MB loaded, 9065.29 MB offloaded, 1012.57 MB buffer reserved, lowvram patches: 909
(RES4LYF) rk_type: res_2s
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:45<00:00, 11.44s/it]
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 22.78 MB buffer reserved, lowvram patches: 0
# 😺dzNodes: LayerStyle -> AddGrain Processed 1 image(s).
WAS Node Suite Warning: The path `F:\AI\ComfyUI-Nightly\ComfyUI\output\Image/Zimage/2025-12-04` specified doesn't exist! Creating directory.
WAS Node Suite: Image file saved to: F:\AI\ComfyUI-Nightly\ComfyUI\output\Image\Zimage\2025-12-04\img_0001.png
Prompt executed in 388.75 seconds

mohtaufiq175 avatar Dec 04 '25 02:12 mohtaufiq175

What does async offload actually do? It’s the default option now, so I assumed it would boost performance… but for me it feels the opposite.

Async offload moves weights you are about to use to the GPU ahead of time while it finished earlier computation. Unfortunately is not implemented for Comfyui-GGUF. Neither is the recent memory pinning feature. If you have the non GGUF models handy you could give them a go with the native loader and you may perform better.

In my usual text2image workflow (SDXL/ZImage + Wan 2.2 low-noise), things are slower. The Wan text encoder takes way longer with async offload turned on.

This looks like a bug to me and I will investigate when I get a chance. It looks consistent with @hum-ma report above.

Thanks for this data.

Can you confirm your GPU model?

rattus128 avatar Dec 04 '25 03:12 rattus128

What does async offload actually do? It’s the default option now, so I assumed it would boost performance… but for me it feels the opposite.

Async offload moves weights you are about to use to the GPU ahead of time while it finished earlier computation. Unfortunately is not implemented for Comfyui-GGUF. Neither is the recent memory pinning feature. If you have the non GGUF models handy you could give them a go with the native loader and you may perform better.

In my usual text2image workflow (SDXL/ZImage + Wan 2.2 low-noise), things are slower. The Wan text encoder takes way longer with async offload turned on.

This looks like a bug to me and I will investigate when I get a chance. It looks consistent with @hum-ma report above.

Thanks for this data.

Can you confirm your GPU model?

I'm using GGUF only for the text encoder on this wf, while the diffusion model still runs on the native loader. So that mean I still don't get the benefit of async offload and memory pinning, even if GGUF is used only for TE?

And my gpu is 3070 8GB

mohtaufiq175 avatar Dec 04 '25 03:12 mohtaufiq175

What does async offload actually do? It’s the default option now, so I assumed it would boost performance… but for me it feels the opposite.

Async offload moves weights you are about to use to the GPU ahead of time while it finished earlier computation. Unfortunately is not implemented for Comfyui-GGUF. Neither is the recent memory pinning feature. If you have the non GGUF models handy you could give them a go with the native loader and you may perform better.

In my usual text2image workflow (SDXL/ZImage + Wan 2.2 low-noise), things are slower. The Wan text encoder takes way longer with async offload turned on.

This looks like a bug to me and I will investigate when I get a chance. It looks consistent with @hum-ma report above. Thanks for this data. Can you confirm your GPU model?

I'm using GGUF only for the text encoder on this wf, while the diffusion model still runs on the native loader. So that mean I still don't get the benefit of async offload and memory pinning, even if GGUF is used only for TE?

And my gpu is 3070 8GB

You should get async benefit. Your lumina is a little faster but your WAN results are strange, I'll see if I can get this performance point but I'll need to revisit this after other stuff. What is your resolution, frames and cfg for WAN?

The text encoder issue should be fixed on latest git. Thanks.

rattus128 avatar Dec 04 '25 05:12 rattus128

What does async offload actually do? It’s the default option now, so I assumed it would boost performance… but for me it feels the opposite.

Async offload moves weights you are about to use to the GPU ahead of time while it finished earlier computation. Unfortunately is not implemented for Comfyui-GGUF. Neither is the recent memory pinning feature. If you have the non GGUF models handy you could give them a go with the native loader and you may perform better.

In my usual text2image workflow (SDXL/ZImage + Wan 2.2 low-noise), things are slower. The Wan text encoder takes way longer with async offload turned on.

This looks like a bug to me and I will investigate when I get a chance. It looks consistent with @hum-ma report above. Thanks for this data. Can you confirm your GPU model?

I'm using GGUF only for the text encoder on this wf, while the diffusion model still runs on the native loader. So that mean I still don't get the benefit of async offload and memory pinning, even if GGUF is used only for TE? And my gpu is 3070 8GB

You should get async benefit. Your lumina is a little faster but your WAN results are strange, I'll see if I can get this performance point but I'll need to revisit this after other stuff. What is your resolution, frames and cfg for WAN?

The text encoder issue should be fixed on latest git. Thanks.

Oh wait, let me clarify. For me this isn’t strange, it’s always been like this i guess. You might’ve thought it strange bcs the first wan sampling looks slower than the second? that’s just because the resolutions are slightly different.

I’m using WAN 2.2 Low Noise as a refiner for ZImage/SDXL images:

First WAN sampling: 1088×1440, CFG 1, 4 steps, res2s

Second WAN sampling: 1024×1024, CFG 1, 4 steps, res2s

So yeah, the speeds differs a bit.

The thing that made me think async offload wasn’t give me any benefit was actually the text encoder running on the CPU, which made the gen time slower. But like you said, that part’s fixed now, everything looks good now for me. Thank you!

mohtaufiq175 avatar Dec 04 '25 06:12 mohtaufiq175

__

Your estimated inference VRAM is extremely high, what is your resolution and frame count (and batch)?

I fixed a bug in the reservation estimator last night that might help you use more VRAM but not by that much.

Is there a lora in play?

The estimation can be a bit low on high res wan and it's due for a recalibration. I'll look into that. A workflow that exactly matches you data point on 4080 16GB would help

3 / 3 Yes, I am using the 4-step lightx2v LoRA. Resolution is 1024 (height/resolution dynamically adjusted to 1024 via LayerUtility: ImageScaleByAspectRatio V2 so both image and latent are exactly 1024 on the longer side).

The workflow is just the default ComfyUI example with the UNet swapped to a GGUF model

test 4.json

Mr-small-2-six avatar Dec 04 '25 09:12 Mr-small-2-six

Honestly, isn't gguf support a basic feature that should have been implemented in core a long time ago, not something that should be in an extension?

arcum42 avatar Dec 04 '25 09:12 arcum42

I'm experiencing an extreme slow-down with ComfyUI 0.3.76 with a workflow which does nothing else but loads a umt5 Q4 GGUF CLIP to device cuda:0 (MultiGPU node), encodes a text and saves the condition to disk (Condition-Utils node).

Can you retry this without the MultiGPU loader? This is likely to have complex interactions with some recent changes. Cut a fresh issue without MultiGPU and the workflow or even just a screenshot if its only a few small nodes if you still get this big clip

I'm interested if you have any difference without GGUF too.

I made a new venv, installed 0.3.76 on it and looks like loading a UMT5-XXL FP8 safetensors and encoding the first prompt takes around 11 minutes, and encoding the next prompt 8 minutes. A GGUF is now just a little slower at about 13 minutes 1st prompt / 8m30s for subsequent prompts. This is the same with or without MultiGPU used for loading or even installed or not.

Although it's much slower than the few seconds of encoding on top of 3-4 minutes of loading I was used to when GPU could still be used for text encoding, it's no longer the 30+ minute times from yesterday. I have yet to pinpoint what made this difference. Async offloading was originally disabled because it kills Z-Image LoRA training with an AttributeError about a 'record_event' from torch/cuda/streams.py (another issue altogether) but now having ran these tests with it enabled I tried disabling it again and the time for loading+encoding is still the same (e.g. 13 minutes for GGUF).

The problem remains that the latest ComfyUI is preventing GPU usage for CLIP, but I might have to make a new issue about that. Meanwhile I'll settle with 0.3.75 which I just realized can do FP8 MultiGPU load+encode in just a little over 100 seconds and next prompts in 5 seconds.

hum-ma avatar Dec 04 '25 15:12 hum-ma

I'm experiencing an extreme slow-down with ComfyUI 0.3.76 with a workflow which does nothing else but loads a umt5 Q4 GGUF CLIP to device cuda:0 (MultiGPU node), encodes a text and saves the condition to disk (Condition-Utils node).

Can you retry this without the MultiGPU loader? This is likely to have complex interactions with some recent changes. Cut a fresh issue without MultiGPU and the workflow or even just a screenshot if its only a few small nodes if you still get this big clip I'm interested if you have any difference without GGUF too.

I made a new venv, installed 0.3.76 on it and looks like loading a UMT5-XXL FP8 safetensors and encoding the first prompt takes around 11 minutes, and encoding the next prompt 8 minutes. A GGUF is now just a little slower at about 13 minutes 1st prompt / 8m30s for subsequent prompts. This is the same with or without MultiGPU used for loading or even installed or not.

The problem remains that the latest ComfyUI is preventing GPU usage for CLIP, but I might have to make a new issue about that. Meanwhile I'll settle with 0.3.75 which I just realized can do FP8 MultiGPU load+encode in just a little over 100 seconds and next prompts in 5 seconds.

We fixed two bugs in clip loading today so hopefully your flows work better. Feel free to pull latest git or let us know once 3.77 comes.

rattus128 avatar Dec 04 '25 16:12 rattus128

@LukeG89 these options are performance critical, especially pinned memory. If you have a problem with these options left on we can take that as a straight bug report

Never had problems with pinned memory, but a few days ago async offloading crashed ComfyUI in some occasions, once it also made my computer shut down in protection mode for some reasons. Now I can't say if the latest updates have solved some issues or not, I will keep both options enabled for now on and see.

Try without those performance disable flags and else us know if --reserve-vram level (incl negative) is making a difference.

@rattus128 Ok, I made some comparisons and you are right: Maximizing VRAM usage doesn't change much. (it's more of a placebo effect I guess 🤷 )

I reserved vram with different negative values based on model precision and LoRAs, to reach ~7.5GB of VRAM usage during inference.

For these tests I updated to this commit and kept pinned memory and async offloading enabled.

I buffer reserve a little more than you but you should be outrunning me significantly on compute power and I assume you have your card in a working x16 primary slot?

Ehm... I'm actually on a laptop 😅 (I should've mentioned that 😆 ) RTX3070Ti 8GB Laptop GPU + 64GB DDR5 SO-DIMM 4800MHz

TEST 1

All fp8_scaled models - NO LoRAs - 20 steps (10high-10low) - CFG 3 - 640x480 81 frames

Args: --disable-all-custom-nodes

VRAM usage: 5.2/8.0GB Pinned mem usage: 11.8/31.9GB

loaded partially; 1794.55 MB usable, 1592.01 MB loaded, 12037.07 MB offloaded, 202.54 MB buffer reserved, lowvram patches: 0
loaded partially; 1787.55 MB usable, 1574.55 MB loaded, 12054.53 MB offloaded, 202.54 MB buffer reserved, lowvram patches: 0

Gen speed: ~68.50s/it


Args: --reserve-vram -1.7 --disable-all-custom-nodes

VRAM usage: 7.5/8.0GB Pinned mem usage: 9.6/31.9GB

loaded partially; 4135.35 MB usable, 3932.83 MB loaded, 9696.25 MB offloaded, 202.51 MB buffer reserved, lowvram patches: 0
loaded partially; 4128.35 MB usable, 3919.88 MB loaded, 9709.19 MB offloaded, 202.51 MB buffer reserved, lowvram patches: 0

Gen speed: ~68.50s/it


TEST 2

All fp8_scaled models - lightx2v LoRAs - 6 steps (2high-4low) - CFG 1 - 640x480 81 frames

Args: --disable-all-custom-nodes

VRAM usage: 4.3/8.0GB Pinned mem usage: 12.6/31.9GB

loaded partially; 1794.55 MB usable, 774.39 MB loaded, 12854.68 MB offloaded, 1012.54 MB buffer reserved, lowvram patches: 388
loaded partially; 1787.55 MB usable, 774.39 MB loaded, 12854.68 MB offloaded, 1012.54 MB buffer reserved, lowvram patches: 388

Gen speed: ~41.50s/it


Args: --reserve-vram -2.5 --disable-all-custom-nodes

VRAM usage: 7.5/8.0GB Pinned mem usage: 9.6/31.9GB

loaded partially; 4954.55 MB usable, 3942.03 MB loaded, 9687.04 MB offloaded, 1012.51 MB buffer reserved, lowvram patches: 341
loaded partially; 4947.55 MB usable, 3922.36 MB loaded, 9706.72 MB offloaded, 1012.51 MB buffer reserved, lowvram patches: 342

Gen speed: ~39.70s/it


TEST 3

All fp16 models - NO LoRAs - 20 steps (10high-10low) - CFG 3 - 640x480 81 frames

Args: --disable-all-custom-nodes

VRAM usage: 5.7/8.0GB Pinned mem usage: 25.3/31.9GB

loaded partially; 1794.55 MB usable, 1388.58 MB loaded, 25864.39 MB offloaded, 405.08 MB buffer reserved, lowvram patches: 0
loaded partially; 1787.55 MB usable, 1382.47 MB loaded, 25870.50 MB offloaded, 405.08 MB buffer reserved, lowvram patches: 0

Gen speed: ~67.80s/it


+ Load Clip device: cpu (otherwise OOM in text encoding phase)

Args: --reserve-vram -1.3 --disable-all-custom-nodes

VRAM usage: 7.5/8.0GB Pinned mem usage: 23.5/31.9GB

loaded partially; 3743.87 MB usable, 3328.96 MB loaded, 23924.01 MB offloaded, 405.08 MB buffer reserved, lowvram patches: 0
loaded partially; 3718.75 MB usable, 3313.67 MB loaded, 23939.30 MB offloaded, 405.08 MB buffer reserved, lowvram patches: 0

Gen speed: ~68.00s/it


TEST 4

All fp16 models - lightx2v LoRAs - 6 steps (2high-4low) - CFG 1 - 640x480 81 frames

Args: --disable-all-custom-nodes

VRAM usage: 4.4/8.0GB Pinned mem usage: 26.2/31.9GB

loaded partially; 1794.55 MB usable, 548.42 MB loaded, 26704.55 MB offloaded, 1215.08 MB buffer reserved, lowvram patches: 396
loaded partially; 1787.55 MB usable, 548.42 MB loaded, 26704.55 MB offloaded, 1215.08 MB buffer reserved, lowvram patches: 396

Gen speed: ~35.50s/it


+ Load Clip device: cpu (otherwise OOM in text encoding phase)

Args: --reserve-vram -2.5 --disable-all-custom-nodes

VRAM usage: 7.5/8.0GB Pinned mem usage: 23.1/31.9GB

loaded partially; 4972.67 MB usable, 3754.04 MB loaded, 23498.93 MB offloaded, 1215.08 MB buffer reserved, lowvram patches: 371
loaded partially; 4947.55 MB usable, 3704.03 MB loaded, 23548.94 MB offloaded, 1215.08 MB buffer reserved, lowvram patches: 372

Gen speed: ~35.00s/it

LukeG89 avatar Dec 04 '25 17:12 LukeG89

We fixed two bugs in clip loading today so hopefully your flows work better. Feel free to pull latest git or let us know once 3.77 comes.

I applied the lines to 0.3.76 and it works perfectly, loads onto GPU and encodes as fast as I could hope on this hardware. Many thanks, it seems I'll be keeping up with the newest releases after all!

hum-ma avatar Dec 04 '25 18:12 hum-ma