ComfyUI-WanVideoWrapper icon indicating copy to clipboard operation
ComfyUI-WanVideoWrapper copied to clipboard

Video as a prompt

Open Rudra-ai-coder opened this issue 6 months ago • 23 comments

video as context: ByteDance/Video-As-Prompt-Wan2.1-14B https://huggingface.co/ByteDance/Video-As-Prompt-Wan2.1-14B @kijai @kabachuha

Rudra-ai-coder avatar Oct 23 '25 07:10 Rudra-ai-coder

49 frames meh :/

kijai avatar Oct 23 '25 07:10 kijai

What about ability to stitch multiple videos together (using the 49 frames thing? as base) Idk

BestofthebestinAI avatar Oct 23 '25 08:10 BestofthebestinAI

can be used to create gifs

Rudra-ai-coder avatar Oct 23 '25 11:10 Rudra-ai-coder

A great LoRA substitute. Basically, IP-Adapter, but for VFX

kabachuha avatar Oct 23 '25 11:10 kabachuha

can be used to create gifs

What needs to be done so it get working and implemented on the wrapper? (whats the roadmap of the work Kijai would usually do here?) I am curious

BestofthebestinAI avatar Oct 23 '25 14:10 BestofthebestinAI

@BestofthebestinAI They add transformer blocks "parallel" to the main ones, resulting in "mixture of transformers" (like in their Bagel). In Wrapper, it may look like VACE or S2V (additional cross attention), but instead of sequential cross attention, an additional concatenated attention is placed

scheme

kabachuha avatar Oct 23 '25 14:10 kabachuha

Lol, I tested it through diffusers and got a weird result. Maybe it's too low steps, of course. (20 vs 50)

https://github.com/user-attachments/assets/2277741b-37c1-488b-a25d-5810ade188e3

The reference is a dog from "deflate" lora.

I used a vibe-coded ComfyUI custom node for myself just now. https://gist.github.com/kabachuha/b6108cd9ac5e57641badaac45786fd01

It uses full CPU offload, so may be quite slow.

kabachuha avatar Oct 23 '25 18:10 kabachuha

Reference:

https://github.com/user-attachments/assets/8de00d5c-5926-4cdd-8109-fec9ff79e900

kabachuha avatar Oct 23 '25 18:10 kabachuha

looks like a lot of those viral clips with cut-as-a-cake, inflate, explode etc ;-) I guess thats the aim of the model as well.

Maybe someone clever can extract a lora from the model, and be used in regular workflows. But maybe its not as "easy/simple" as that

-- Googled "how to extract lora", and of course Kijai shows up ;-)
Seems like there is a node in KJ Nodes that might do that (but if its usable for this model i dont know hehe)

RuneGjerde avatar Oct 23 '25 19:10 RuneGjerde

Yes, they are clearly inspired by those clips

Well, it's not just a LoRA. In fact, they add an entire parallel block and change the positional embeddings - RoPE :) This is slightly more complicated, but should be possible in the wrapper

kabachuha avatar Oct 23 '25 19:10 kabachuha

wow this is ip adapter for videos . any news?

gastonbarreto2-ai avatar Oct 24 '25 03:10 gastonbarreto2-ai

We are waiting for the mastermind to decide to make it work for us normal people^^, I although I wish I could do it myself, or learn to

BestofthebestinAI avatar Oct 24 '25 11:10 BestofthebestinAI

Sneak peeking in branches it seems it's already cooking ;-)

i did try it, but had some OOM issues with it. Will try some more later ;-)

RuneGjerde avatar Nov 03 '25 17:11 RuneGjerde

Thank you Rune You saw something in the dev branch?

BestofthebestinAI avatar Nov 04 '25 10:11 BestofthebestinAI

@kijai please add vap gguf for use with wan 2.1 gguf

gastonbarreto2-ai avatar Nov 05 '25 03:11 gastonbarreto2-ai

I tried the workflow from the vap repo but the videos are not really similar. It takes the reference image and tries to use the motions of the input video in a very rudimentary way. Do you have good examples how to use it correctly? Do I have to prompt very large prompts as in kijai's example?

railep avatar Nov 05 '25 07:11 railep

Will give it a try, had some OOM issues with the workflow, will see if i can get through

RuneGjerde avatar Nov 05 '25 15:11 RuneGjerde

I can't really get anything out of this, I can get matching results to their own code, but even that doesn't seem to work. Either it's extremely limited how it can be used and the prompts have to be perfect, or their code has some bug, or the Wan version just doesn't work.

kijai avatar Nov 05 '25 15:11 kijai

wouldnt lose sleep over it, its a bit of a gimmicky feature i guess ;-)

RuneGjerde avatar Nov 05 '25 15:11 RuneGjerde

Gave it a test, had to lower the resolution a bit to not OOM But seems to work ok for me, i guess , at least from a random couple of test runs

https://github.com/user-attachments/assets/7c7ceee8-ca0b-450a-b6f6-07145186ead3

https://github.com/user-attachments/assets/a66e1437-1dd3-45d9-ac12-fc208830e2bf

(i guess the "as a balloon made them fly away, so that might be promptin")

I tried some other "effects" too, but it seems tricky to get the description prompt correct ..
Probably should have added "by some large fingers" or something in the prompt to be an "exact copy", but the transformation seems to work

So as you said, my initial feeling is that it needs very accurate prompting. Think that might be the challenge

https://github.com/user-attachments/assets/48736da1-f21e-483d-9deb-75abcb580073

Anyways, was just a quick test, think the implementation is perhaps ok as is..

RuneGjerde avatar Nov 07 '25 18:11 RuneGjerde

Probably should have added "by some large fingers" or something in the prompt to be an "exact copy", but the transformation seems to work

large fingers added to the prompt, and it seems ok

https://github.com/user-attachments/assets/503cde23-8525-4302-aa4e-e1068859c1f8

anyways, a bit of a fun gimmick i guess, only 49 frames, and you said you had some troubles with the code, so i wouldnt lose sleep over it ;-) works ok as is it seems


https://github.com/user-attachments/assets/32cd5977-72ea-4b88-9bed-b2052ec2f755

https://github.com/user-attachments/assets/b0ad9575-33e7-47ca-a488-5c2fe88c7615

Edit: trying some more, it might be that the crucial part is that the exact same transition words are duplicated in both the reference video and target video prompt. At least i could get away with far simpler prompts testing it a bit more. Just copy pasting a generic prompt about the transition to both, and only change subject description (and background if needed) seemed to work for me at least. But just a theory, only tried a few

But all in all it seems to work ok, didnt try all examples, but a few styles, a few camera movements, and a few funny ones like squish and inflate

RuneGjerde avatar Nov 07 '25 18:11 RuneGjerde

Nice! How I can I try that? @RuneGjerde

BestofthebestinAI avatar Nov 10 '25 14:11 BestofthebestinAI

Nice! How I can I try that? @RuneGjerde

Its currently in a separate branch, but maybe Kijai will merge to main soon, unless he intend to do some more work on it. You can try it if you switch branch, but for other things probably best to switch back to main after. Since its more up to date on other fixes that might been added (although its only 2 commits behind main, so probably no urgency to go back to main)

git switch vap

git switch main

(inside the WanVideoWrapper folder)

Alternatively just delete that folder, and write

git clone --single-branch --branch vap https://github.com/kijai/ComfyUI-WanVideoWrapper

And to switch back delete again and:

git clone https://github.com/kijai/ComfyUI-WanVideoWrapper

RuneGjerde avatar Nov 10 '25 16:11 RuneGjerde