Question about WanVideoSampler - Can the SAMPLES output be split?
Since I am running OOM on the WanVideoDecode step of the loop workflow I've thought that it could be solved splitting the samples output from the WanVideoSampler and decode it in two steps, batching it later.
Using a typical Split Latent node won't work, is there any workaround?
Try the kjnode: Get latents from batch. What about vae tiling? Should help avoid oom.
It can't be split in any manner that keeps the temporal continuity at the seam, you'll get some artifacts when joining them.
VAE tiling is the last resort solution when the issue is spatial dimensions, memory usage from frame count doesn't really increase with Wan VAE.
Have you updated ComfyUI itself to fix the VAE bug, if you are on pytorch 2.9? And make sure not to accidentally use fp32 VAE.
Thanks for the suggestions - Yes I have dodged the OOMs by using a combination of tiling and block-swapping plus adding a Clean VRAM Used node right after the WanVideo Decode (not sure if this last step is actually helping but, yeah), which increases a bit the overall processing time but i guess it is what it is.
I am indeed on pytorch 2.9 and have promptly updated as I've read your message but I haven't seen any changes other than in the workflow templates so perhaps I had that VAE problem patched on a previous update.
Regarding model settings, I should have included that in the opening post, using atm:
-
Model HI: wan2.2_i2v_A14b_high_noise_lightx2v_4step-Q5_K_S.gguf
-
Model LOW: wan2.2_i2v_A14b_low_noise_lightx2v_4step-Q5_K_S.gguf precision - fp16_fast | quant - disabled | load device - offload | attention - sageattn | rms - default
-
Text encoder: umt5-xxl-encoder-Q5_K_S.gguf | type - wan
-
VAE: wan_2.1_vae.safetensors | precision - bf16
-
Acceleration LORA: wan2.2_i2v_A14b_low_noise_lora_rank64_lightx2v_4step_1022.safetensors | str 1
If you guys see any mismatch/problem on the config pls let me know.
Thanks for the suggestions - Yes I have dodged the OOMs by using a combination of tiling and block-swapping plus adding a Clean VRAM Used node right after the WanVideo Decode (not sure if this last step is actually helping but, yeah), which increases a bit the overall processing time but i guess it is what it is.
I am indeed on pytorch 2.9 and have promptly updated as I've read your message but I haven't seen any changes other than in the workflow templates so perhaps I had that VAE problem patched on a previous update.
Regarding model settings, I should have included that in the opening post, using atm:
- Model HI: wan2.2_i2v_A14b_high_noise_lightx2v_4step-Q5_K_S.gguf
- Model LOW: wan2.2_i2v_A14b_low_noise_lightx2v_4step-Q5_K_S.gguf precision - fp16_fast | quant - disabled | load device - offload | attention - sageattn | rms - default
- Text encoder: umt5-xxl-encoder-Q5_K_S.gguf | type - wan
- VAE: wan_2.1_vae.safetensors | precision - bf16
- Acceleration LORA: wan2.2_i2v_A14b_low_noise_lora_rank64_lightx2v_4step_1022.safetensors | str 1
If you guys see any mismatch/problem on the config pls let me know.
I don't know how the whole workflow looks like, but apparently you are using "plain" wan 2.2 I2V. One thing to try, if you have time, is to rebuild the same workflow logic, if possible, with only core/standard ComfyUI nodes (I think it would be possible even with gguf, I hope I don't talk out of my a** since I don't use gguf myself) and see if it works better. In the past I had problems with OOM and/or slow processing in WanVideoWraper that were alleviated by using core ComfyUI nodes. Another thing would be to update to the last torch nightly 2.10.0, it works much better than 2.9.0 in my opinion, at least on my system.
Edit: Another idea about vae decoding that I have seen around (I didn't tried it myself) is to save the latent to disk without decoding it and decode it later in another workflow using strictly necessary nodes for that operation. (Don't ask me exactly how, I have an idea but since I never tried it I can't speak more on the matter).
The HI/LO models I am currently using are not the base/plain ones, i chose this ones in particular because they work fairly well and are the smallest Q5s I've found - I will try to see if the Q8s work by pushing blockswap to the limit.
The workflow is Kijai's, same than the WanVideoWrapper nodes which IMO are the 'official standard' ComfyUI ones as it is mentioned in comfy.org - you can check them out (together with the GGUFs which are in the same link) in:
https://docs.comfy.org/tutorials/video/wan/wan2_2
I have found a decent way to avoid OOMs but it is by sacrificing quite a bit of generation speed (mentioned in my post above), and regarding the VAE I have never tried that, but I believe the core of the problem would be the same, you need to load the latent + the VAE and run the operation yes or yes, so as Kijai mentioned above the only thing that has worked so far is VAE tiling.