And another new Subject2Video Wan 2.1 Model (ByteDance)
https://github.com/bytedance/BindWeave A new Phantom-like model with very similar benchmark results. Not sure if it is worth trying but wanted to share it if someone wants to try it. (Don‘t think that there are code modifications necessary to run it.)
From the examples it looks very interesting. Not as "pasted on top" as Vace sometimes look like. More integrated and natural looking, and good face consistency. But of course might be cherry picked examples ;-) https://lzy-dot.github.io/BindWeave/
It does need some new code because it uses Qwen 2.5 VL 7B as additional conditioner.
We already have your prompt extender and native clip for qwen image. Isn‘t that enough for that?
We already have your prompt extender and native clip for qwen image. Isn‘t that enough for that?
No, this uses the "raw" output (hidden_states) from Qwen VL 2.5 7B directly, there are 2 new layers in the model to project those to be added to text embedding. Also I never implemented the VL version, though there's implementation of that in core Comfy which could be enough for this.
Yes, comfy has qwenvl. Maybe simply input its embeddings? (CLIP) On the other side, because of ComfyUI memory management, this model can stay in RAM /VRAM on high VRAM systems, so maybe a new node is needed after all. Like a ComfyUI's code, but with KJ offloading / disk caching
the wizard is cooking ;-) i saw something brewing
Still figuring the inputs out, not the easiest code to reverse engineer... even if not that much new code is needed, the inputs are quite unique, it's doing something already though:
https://github.com/user-attachments/assets/7e1fb6f6-aae4-4907-a180-30b89e4d5479
that looks promising for sure ;-)
Is it intentional that the model gets the images as overlays? Two subjects don’t work.
Is it intentional that the model gets the images as overlays? Two subjects don’t work.
Lightx2v seems to mess it up some at least, more often it makes it obey the positioning of the references too much. If you manually place them, it does work:
https://github.com/user-attachments/assets/5e72f3ac-2bf2-411d-92b2-d18c1f68af86
https://github.com/user-attachments/assets/261e49fa-984f-4869-8d4f-1041049945a1
Is it intentional that the model gets the images as overlays? Two subjects don’t work.
Lightx2v seems to mess it up some at least, more often it makes it obey the positioning of the references too much. If you manually place them, it does work:
WanVideoWrapper_I2V_00001.3.mp4
WanVideoWrapper_I2V_00002.7.mp4
I see. Thank you! Well… I guess, I have to code a custom node for that (unless I find one in kjnodes).
Is it intentional that the model gets the images as overlays? Two subjects don’t work.
Lightx2v seems to mess it up some at least, more often it makes it obey the positioning of the references too much. If you manually place them, it does work: WanVideoWrapper_I2V_00001.3.mp4 WanVideoWrapper_I2V_00002.7.mp4
I see. Thank you! Well… I guess, I have to code a custom node for that (unless I find one in kjnodes).
The above was just by changing the crop_position in the resize node, the padding also obeys that.
Yeah, I saw that and tried it too. For character tests, I take business photos of my wife and myself with similar resolutions, and a background in the same resolution. To put it left and right, I need different resolutions… it worked with 900x350 but the background was 2/3 filled. And got interesting results because I forgot to change the prompt :-D In short: I need a wf to cut the images according to the mask (to make it less wide) and the put it left or right. Should be possible with already existent nodes. Thanks again!
Struggle to get it running without OOM, but if setting resolution really low I could get through it (but also the result suffer i bet) But had to give it a test ;-)
https://github.com/user-attachments/assets/454e8f06-e666-4c20-acc1-89a687cdcff7
Hollywood watch out ;-) Gonna make my own Snyder cut hehe
I tried and tried but I am not satisfied. I have the feeling that it is basically merging pictures like a canvas editor and then making an i2v inference of the merged picture.
@kijai Can you please provide a workflow that is available for bindweave branche for test?
@kijai Can you please provide a workflow that is available for bindweave branche for test?
Haven't finalized anything yet, but the videos I've posted here should include a workflow.
Gave it another test run, even if its work in progress.. managed to get much higher resolutions now ;-) not sure if it was my pc, or some code optimization. But no OOM anymore
https://github.com/user-attachments/assets/196b3704-3552-4620-9480-17ec01ee833f
https://github.com/user-attachments/assets/d7b48186-78af-4f1b-a281-ac978c7a0a68
The Snider Cut - AI Edition ;-)
While not the best animation, I was surprised the model was able to draw the character from behind from the start frame even, also this shows overlapping references can work:
https://github.com/user-attachments/assets/520fc89c-f29c-40b3-8f3e-d8236dd49f6d
that looks pretty good. Not locked to left right then ;-) And in some ways look even better, since it casts shadow and all.. gives it a bit of perspective
I think one key thing to do is make sure the clip vision and qwenvl embeds are cropped properly and include your subjects, since clip vision is locked to 224x224 resolution. I'm still not sure of the qwenVL resolution, but seems better to crop for it too.
yeah that changes the output a lot. It looks far more realistic and has depth when the characters are not put left and right. Odd, what if you do want characters left and right ;) (or maybe a prompt would do that "woman to the left, man to the right")
just a completely random run, with lazy prompting, just "viking exploring new world". And not composed correctly i bet, was just a quick test run
https://github.com/user-attachments/assets/99c981c2-55dd-481c-9731-1e64577ef3f5
Odd, what if you do want characters left and right ;) (or maybe a prompt would do that "woman to the left, man to the right")
That works ;-)
https://github.com/user-attachments/assets/6913da6c-2ae9-4c16-a787-c559dbf98ab9
and with some additional prompting, the characters seems follow better (or i could be imagining that part). At least seemly a bit more realistic, than previous attempts when i had it left right
https://github.com/user-attachments/assets/ee4f2969-1a4e-4532-bbac-ff9d9c6390cf
Will play around with it a bit ;-)
Wow, ok this is MUCH better. Now it really is useful without cropping the images to right and left. Thanks @kijai
Yeah its growing on me for sure. Can quite easily use it to tell a little story with consistent characters, swap out the background for each scene etc.
@kijai @RuneGjerde can this be used together with WanAnimate? WanAnimate frequently loses subject likeness
@jnpatrick99 Probably not Wan Animate.
But Lynx might be, but not sure. Its an "extra model" that works in WanVideo Wrapper. And its strength its keeping the face ID. For info https://byteaigc.github.io/Lynx/
https://github.com/kijai/ComfyUI-WanVideoWrapper/blob/main/example_workflows/wanvideo_T2V_14B_lynx_example_01.json But would need some creative node connections, probably just connecting "Add Lynx Embed" before connecting the main node to image_embed at the sampler.
(i'll try later, when i have a chance, unless some of the other experienced ones have something )
But Kijai will know better for sure, if Lynx could be used or not ;-)
@RuneGjerde Thanks, but unfortunately couldn't make either of them to work. Lynx produces an error about IPAdapter incompatibility with current model, and bindweave an error about tensor dimensions :-(