OOM - BlockSwap Broken?
I updated to build 1.3.9 of WanVideoWrapper and everything OOMs now. I was previously running either build 1.3.5 or 1.3.6. Before I could run 101 frame generations at 1024x768 with blockswap set to 40 (on a 4090). Now I need to pull the frame count down to 41.
One thing I noticed is that I no longer see a "blocks to swap: " statement printed in the console as the job is spinning up. Is it possible that the latest update broke blockswap? That would be consistent with my symptoms. I'm on Windows 10 with the latest Nvidia drivers and latest ComfyUI portable build.
No, block_swap is working fine, the log entry was removed a while ago when the whole model loading was changed.
You're probably experiencing the torch.compile Triton cache issue, you should try clearing your Triton caches as mentioned in the readme.
Thank you for the response @kijai . Rebuilding the Triton cache helped to reclaim a bit of GPU memory, but I ultimately ended up clearing the cache AND reverting to build 1.3.6. The changes in build 1.3.9 result in a substantial increase in vRAM and its not clear that the benefit outweighs the downside.
Some data: On 1.3.9 after clearing Triton cache, I was able to just barely crank out an 81 frame 1024x768 generation, if I did nothing else on my PC that used vRAM (I was at 23.1 GB vRAM utilized). After clearing the cache and reverting to build 1.3.6, I'm able to generate 101 frames at 1024x768 while only using 14.5GB vRAM. This is all on a 4090 with 24GB of vRAM and blockswap set to 40. I'm using only two LoRAs in this workflow.
Thank you for the response @kijai . Rebuilding the Triton cache helped to reclaim a bit of GPU memory, but I ultimately ended up clearing the cache AND reverting to build 1.3.6. The changes in build 1.3.9 result in a substantial increase in vRAM and its not clear that the benefit outweighs the downside.
Some data: On 1.3.9 after clearing Triton cache, I was able to just barely crank out an 81 frame 1024x768 generation, if I did nothing else on my PC that used vRAM (I was at 23.1 GB vRAM utilized). After clearing the cache and reverting to build 1.3.6, I'm able to generate 101 frames at 1024x768 while only using 14.5GB vRAM. This is all on a 4090 with 24GB of vRAM and blockswap set to 40. I'm using only two LoRAs in this workflow.
What workflow? Obviously I'm not experiencing any increase to VRAM usage myself...
This is with your WAN 2.2 14B I2V workflow.
i meet same problem,when run to ksampler ,vram and ram have not exhausted comfyui exit ,when i switch comfyui to old version,problem gone
I seem to be experiencing the same issue of increased vram usage. I have a 4090. Previously I would do 20 blocks swapped to generate 1280x720x81. Now I need 35 blocks swapped to avoid OOM.
Did you try cleaning the triton cache, as mentioned as "news" on the frontpage: https://github.com/kijai/ComfyUI-WanVideoWrapper
To clear your Triton cache you can delete the contents of following (default) folders:
C:\Users\<username>\.triton and C:\Users\<username>\AppData\Local\Temp\torchinductor_<username>
Similar problem here on 3080 ti; had to cut video length from 8 seconds to 4.5 and disable kijais new triton compile node. However downgrading and clearing cache does not help. wanblockswap was also recently banned in the official repos... https://github.com/orssorbit/ComfyUI-wanBlockswap/issues/7
https://github.com/comfyanonymous/ComfyUI/releases/tag/v0.3.70
https://github.com/comfyanonymous/ComfyUI/issues/10809
how to checkout 1.3.6
Total VRAM 23028 MB, total RAM 65228 MB
pytorch version: 2.9.1+cu128
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : cudaMallocAsync
Enabled pinned memory 29352.0
working around nvidia conv3d memory bug.
Using sage attention
Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr 8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
ComfyUI version: 0.3.71
ComfyUI frontend version: 1.28.9
[Prompt Server] web root: C:\ComfyUI_windows_portable\python_embeded\Lib\site-packages\comfyui_frontend_package\static
Total VRAM 23028 MB, total RAM 65228 MB
pytorch version: 2.9.1+cu128
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : cudaMallocAsync
Enabled pinned memory 29352.0
Traceback (most recent call last):
File "C:\ComfyUI_windows_portable\ComfyUI\nodes.py", line 2131, in load_custom_node
module_spec = importlib.util.spec_from_file_location(sys_module_name, os.path.join(module_path, "__init__.py"))
^^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'sys_module_name' where it is not associated with a value
Cannot import C:\ComfyUI_windows_portable\ComfyUI\comfy_extras\nodes_nop.py module for custom nodes: cannot access local variable 'sys_module_name' where it is not associated with a value
So far the advice with deletion nodes_nop.py has helped https://github.com/orssorbit/ComfyUI-wanBlockswap/issues/7#issuecomment-3562803524
I'm noticing a similar issue. Blockswap isn't completely broken, it does reduce VRAM usage, but I'm now having to crank it up from 12 to 28 on workflows that used to work flawlessly in the past. Otherwise I get OOM.
Cleared inductor cache and ensured everything is clean, same issue. Something seems to be consuming significantly more VRAM than it did a month or so ago.
Deleting nodes_nop.py did not help (I didn't think it would, since it's not overriding the blockswap args node in this repo).
i can't even crank up blockswap any more. it was at 38 before and increasing it to max 40 doesn't help.
Still nothing to do with block swapping or it not working, these issues are most likely Triton issues, updating to pytorch 2.9.1 and clearing Triton caches is recommended.
I just tested and running for example 2.2 A14B I2V with full block swap uses under 10GB:
Sampling 81 frames at 704x704 with 3 steps
100%|████████████████████████████████████████████████████| 3/3 [00:36<00:00, 12.21s/it]
Allocated memory: memory=0.169 GB
Max allocated memory: max_memory=7.313 GB
Max reserved memory: max_reserved=9.000 GB
Someone should post a sample workflow so devs can test and see if that problem persists.
I'm noticing a similar issue. Blockswap isn't completely broken, it does reduce VRAM usage, but I'm now having to crank it up from 12 to 28 on workflows that used to work flawlessly in the past. Otherwise I get OOM.
Cleared inductor cache and ensured everything is clean, same issue. Something seems to be consuming significantly more VRAM than it did a month or so ago.
Deleting nodes_nop.py did not help (I didn't think it would, since it's not overriding the blockswap args node in this repo).
I was having the same issues recently, vram spikes to 99% with torch compile full blockswapping. Had to clear both caches and let the workflow run 5 or 6 times before my vram got back to normal
Someone should post a sample workflow so devs can test and see if that problem persists.
Latest update now should make torch.compile less necessary, managed to reduce peak VRAM usage a lot, so if you still experience issues with torch.compile, try running without it.
Latest update now should make torch.compile less necessary, managed to reduce peak VRAM usage a lot, so if you still experience issues with torch.compile, try running without it.
Comfy chunked rope function?
Latest update now should make torch.compile less necessary, managed to reduce peak VRAM usage a lot, so if you still experience issues with torch.compile, try running without it.
Comfy chunked rope function?
Not really necessary anymore, but the option remains to further reduce VRAM usage in some situations.
Latest update now should make torch.compile less necessary, managed to reduce peak VRAM usage a lot, so if you still experience issues with torch.compile, try running without it.
Comfy chunked rope function?
Not really necessary anymore, but the option remains to further reduce VRAM usage in some situations.
Quick test, saved around 12% vram (2gb) compared to before, both times without torch compile. Only a few percent worse compared to when I get torch compile working.
Just reporting my VRAM usage is a lot lower with the recent commit with torch compile on. Like from 92% used down to 60%. Was able to drop block swap down as a result, leading to faster inference as well.
Only issue now is VAE decode is OOMing me, causing me to hover around 99% VRAM utilization, while inference is only around 84% w/ block swap at 20, 24GB VRAM. Don't really want to enable tiling, so I'm having to increase block swap to get around it.
Full graph and sage compiled working also now, wasn't working before https://github.com/woct0rdho/SageAttention/releases/tag/v2.2.0-windows.post4
Latest update now should make torch.compile less necessary, managed to reduce peak VRAM usage a lot, so if you still experience issues with torch.compile, try running without it.
Comfy chunked rope function?
Not really necessary anymore, but the option remains to further reduce VRAM usage in some situations.
Okay, decided to update to latest commit, it blows my mind how big the difference is in term of vram usage.
Previously, using i2v at 576×1024, I needed to use 39 blockswap, and VRAM usage during sampling was around 93%. Now it only uses about 65% VRAM, so I can lower the blockswap to around 34 to get back to ~90% vram usage.
But I’m having the same issue as @scottmudge, now I often get OOM errors on the VAE decode. It feels kinda random, sometimes it happens on the 2nd run, sometimes on the 3rd…
I’m not sure if my approach is wrong, but I usually set the blockswap number to reach around 90% VRAM usage., and it was fine
Block swap doesn't affect anything for the VAE as long as you have the force_offload enabled on the sampler to fully offload the model before decode. I've not noticed any issues with VAE myself, nor did I change anything about it now.
There is one big thing to take into account though, since pytorch 2.9 there's a bug that triples the VAE memory usage, and to counter that I'm using the ComfyUI core workaround for it, which was added maybe a month ago or so. When launching Comfy it should say:
working around nvidia conv3d memory bug.
If you don't have that, then that definitely explains any VAE issues, and your ComfyUI needs an update. Also make sure you're not using fp32 VAE.
Testing the VAE on it's own is helpful to troubleshoot too, something like this should be using ~16GB VRAM:
You can also use the native ComfyUI VAE to decode through the rescale node, it can use slightly less VRAM:
Block swap doesn't affect anything for the VAE as long as you have the force_offload enabled on the sampler to fully offload the model before decode. I've not noticed any issues with VAE myself, nor did I change anything about it now.
There is one big thing to take into account though, since pytorch 2.9 there's a bug that triples the VAE memory usage, and to counter that I'm using the ComfyUI core workaround for it, which was added maybe a month ago or so. When launching Comfy it should say:
working around nvidia conv3d memory bug.If you don't have that, then that definitely explains any VAE issues, and your ComfyUI needs an update. Also make sure you're not using fp32 VAE.
Testing the VAE on it's own is helpful to troubleshoot too, something like this should be using ~16GB VRAM:
You can also use the native ComfyUI VAE to decode through the rescale node, it can use slightly less VRAM:
I'm on the latest comfy with PyTorch 2.10, and I also see "working around NVIDIA conv3d memory bug" in the logs. I’ve also made sure that every node with a force_offload option is enabled.
I ran the test, and doing only VAE decode → encode only seems works fine, wanvideowrapper encode decode actually faster. But based on the VRAM debug, there’s slight different between the native and KJ version. I’m not sure what that value means though...
What I did notice only is that with the native, it uses around 85% VRAM when encoding and about 97% when decoding. Meanwhile, with wanvideowrapper, both encode and decode use 97% VRAM.
native
got prompt
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 7.59 MB buffer reserved, lowvram patches: 0
FETCH ComfyRegistry Data: 25/109
FETCH ComfyRegistry Data: 30/109
FETCH ComfyRegistry Data: 35/109
VRAMdebug: free memory before: 7,276,519,424
VRAMdebug: free memory after: 7,396,655,104
VRAMdebug: freed memory: 120,135,680
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 7.59 MB buffer reserved, lowvram patches: 0
FETCH ComfyRegistry Data: 40/109
FETCH ComfyRegistry Data: 45/109
FETCH ComfyRegistry Data: 50/109
FETCH ComfyRegistry Data: 55/109
FETCH ComfyRegistry Data: 60/109
FETCH ComfyRegistry Data: 65/109
VRAMdebug: free memory before: 8,019,509,248
VRAMdebug: free memory after: 7,396,655,104
VRAMdebug: freed memory: -622,854,144
Prompt executed in 54.70 seconds
wan wrapper
got prompt
FETCH ComfyRegistry Data: 25/109
WanVideoEncode: Encoded latents shape torch.Size([1, 16, 21, 128, 72])
VRAMdebug: free memory before: 7,396,655,104
VRAMdebug: free memory after: 7,396,655,104
VRAMdebug: freed memory: 0
Cannot connect to comfyregistry.
FETCH DATA from: https://raw.githubusercontent.com/ltdrdata/ComfyUI-Manager/main/custom-node-list.json [DONE]
[ComfyUI-Manager] All startup tasks have been completed.
VRAMdebug: free memory before: 7,396,655,104
VRAMdebug: free memory after: 7,396,655,104
VRAMdebug: freed memory: 0
Prompt executed in 46.30 seconds
I've never tried the latent rescale node, but I'll give it a try.
If the VAE works on it's own, but fails after sampler, then something isn't offloading properly... which workflow/model is used in those cases?
If the VAE works on it's own, but fails after sampler, then something isn't offloading properly... which workflow/model is used in those cases?
I'm Using Wan 2.2 I2V 14B Q8 gguf, and using own workflow.
I just tried the latent rescale node, and it oom on 2nd run.
pytorch version: 2.10.0.dev20251101+cu128
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : cudaMallocAsync
Enabled pinned memory 22085.0
working around nvidia conv3d memory bug.
Using sage attention
Python version: 3.11.5 (tags/v3.11.5:cce6ba9, Aug 24 2023, 14:38:34) [MSC v.1936 64 bit (AMD64)]
ComfyUI version: 0.3.75
ComfyUI frontend version: 1.32.9
[Prompt Server] web root: F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\comfyui_frontend_package\static
Total VRAM 8192 MB, total RAM 49078 MB
pytorch version: 2.10.0.dev20251101+cu128
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3070 : cudaMallocAsync
Enabled pinned memory 22085.0
2nd run error, if I didn’t forget to set the seed to fixed or didn't change the prompt, I can just hit run again and it decodes fine.
Block 37: transfer_time=1.4468s, compute_time=0.0053s, to_cpu_transfer_time=0.0020s
Block 38: transfer_time=1.4815s, compute_time=0.0058s, to_cpu_transfer_time=0.0026s
Block 39: transfer_time=1.5065s, compute_time=0.0050s, to_cpu_transfer_time=0.0021s
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [04:28<00:00, 67.21s/it]
Allocated memory: memory=0.415 GB
Max allocated memory: max_memory=5.879 GB
Max reserved memory: max_reserved=6.125 GB
VRAMdebug: free memory before: 6,987,644,928
VRAMdebug: free memory after: 6,987,644,928
VRAMdebug: freed memory: 0
Requested to load WanVAE
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 7.59 MB buffer reserved, lowvram patches: 0
!!! Exception during processing !!! CUDA error: out of memory
Search for `cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Traceback (most recent call last):
File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 510, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 324, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 298, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 286, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\nodes.py", line 295, in decode
images = vae.decode(samples["samples"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\comfy\sd.py", line 751, in decode
samples = samples_in[x:x+batch_number].to(self.vae_dtype).to(self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: out of memory
Search for `cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Prompt executed in 581.95 seconds
got prompt
0 models unloaded.
loaded partially; 0.00 MB usable, 0.00 MB loaded, 242.00 MB offloaded, 7.59 MB buffer reserved, lowvram patches: 0
Prompt executed in 50.51 seconds
another oom
100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [04:38<00:00, 69.72s/it]
Allocated memory: memory=0.572 GB
Max allocated memory: max_memory=6.731 GB
Max reserved memory: max_reserved=6.969 GB
VRAMdebug: free memory before: 6,818,955,264
VRAMdebug: free memory after: 6,818,955,264
VRAMdebug: freed memory: 0
!!! Exception during processing !!! Allocation on device
Traceback (most recent call last):
File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 510, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 324, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 298, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "F:\AI\ComfyUI-Nightly\ComfyUI\execution.py", line 286, in process_inputs
result = f(**inputs)
^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\nodes.py", line 2209, in decode
images = vae.decode(latents, device=device, end_=(end_image is not None), tiled=enable_vae_tiling, tile_size=(tile_x//8, tile_y//8), tile_stride=(tile_stride_x//8, tile_stride_y//8))[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\wanvideo\wan_video_vae.py", line 1346, in decode
video = self.single_decode(hidden_state, device, pbar=pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\wanvideo\wan_video_vae.py", line 1300, in single_decode
video = self.model.decode(hidden_state, pbar=pbar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\wanvideo\wan_video_vae.py", line 1093, in decode
out_ = self.decoder(x[:, :, i:i + 1, :, :],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1783, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1794, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\wanvideo\wan_video_vae.py", line 806, in forward
x = layer(x, feat_cache, feat_idx)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1783, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1794, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\wanvideo\wan_video_vae.py", line 147, in forward
x = self.resample(x)
^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1783, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1794, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\container.py", line 253, in forward
input = module(input)
^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1783, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1794, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\custom_nodes\ComfyUI-WanVideoWrapper\wanvideo\wan_video_vae.py", line 67, in forward
return super().forward(x.float()).type_as(x)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\modules\upsampling.py", line 175, in forward
return F.interpolate(
^^^^^^^^^^^^^^
File "F:\AI\ComfyUI-Nightly\ComfyUI\venv\Lib\site-packages\torch\nn\functional.py", line 4817, in interpolate
return torch._C._nn._upsample_nearest_exact2d(input, output_size, scale_factors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: Allocation on device
Got an OOM, unloading all loaded models.
Prompt executed in 559.72 seconds
Ok I found one thing that may contribute to that: LoRAs weren't being fully offloaded with force_offload since I moved them to use block swap, so depending on your block swap amount part of them would persist in VRAM, that should be fixed now.
Ok I found one thing that may contribute to that: LoRAs weren't being fully offloaded with force_offload since I moved them to use block swap, so depending on your block swap amount part of them would persist in VRAM, that should be fixed now.
Okay, it’s been 5 runs and everything’s working fine so far! Thanks for the fix!

