Vision Bridge Transformer
New Model just dropped. It runs off of the Wan2.1 1.3B model.
It can do video restyling, colorization, video frame interpolation as well as depth to video. Three's a different model for each of those four options.
HG also has a space where you can try it out.
Seems interesting and can probably run on fairly modest hardware and/or geared towards longer vids.
Could be worth a look.
Project: https://yuanshi9815.github.io/ViBT_homepage/
Code: https://github.com/Yuanshi9815/ViBT
Models: https://huggingface.co/Yuanshi/ViBT/tree/main
Cheers Y'all
Looks interesting ;-) was something similar a while ago, that could also stylize video to lego, Ghibli etc ( can't remember what the model was called, but already supported in WanVideo Wrapper).
"Scaled transformers: 20B and 1.3B parameter ViBT variants for image/video translation." Wonder if it also works on the bigger wan
Looks interesting ;-) was something similar a while ago, that could also stylize video to lego, Ghibli etc ( can't remember what the model was called, but already supported in WanVideo Wrapper).
"Scaled transformers: 20B and 1.3B parameter ViBT variants for image/video translation." Wonder if it also works on the bigger wan
Wasn‘t that a Wan Vace Lora?
Wasn‘t that a Wan Vace Lora?
think maybe that was it yes.. https://editto.net/ https://huggingface.co/Kijai/WanVideo_comfy/tree/main/LoRAs/Ditto
Yes, checked the thread history, it was the lego one ;-) https://github.com/kijai/ComfyUI-WanVideoWrapper/issues/1487
Although Vision Bridge seems a bit more powerful, for local editing etc. Even image editing, or removing objects
Only needs the scheduler, nothing else, so added that. The model is just 1.3B so don't expect amazing result quality, but it works. The models work as they are, workflow wise you encode the video and use the scheduler (don't skip steps or add noise), defaults are cfg 2.0 and 10-30 steps.
The interpolation model is different to use, for that you can use cfg 1.0 and you need to interleave repeat the input video before encoding.
Works ;-)
https://github.com/user-attachments/assets/c5bf0cfc-1df1-4d3c-8278-f4959f580edd
https://github.com/user-attachments/assets/84b41cd8-e684-44ac-8255-c641936adb8c
anyone wanting to try: WanVideoWrapper - 2.1 ViBT Vision Transformer.json (added a box with prompt examples from ViBT, and not sure if the workflow is 100% correct, but it is what it is ;))
You have to update WanVideoWrapper though, for the added ViBT scheduler
https://github.com/user-attachments/assets/f49fb25a-6fba-40ae-be4a-e1d3c01a4bb4
https://github.com/user-attachments/assets/308bbac7-2d11-4877-896d-5e97bc8a698e
I am having some flickering issues with the interpolation model/method. Does anyone has an example?
this is what I am doing based on the original code and kijai's answer. But there you can see the result looks awful compared to the video source. Any clues?
this is what I am doing based on the original code and kijai's answer. But there you can see the result looks awful compared to the video source. Any clues?
Hmm try repeating only after first frame, so it's frames 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3 and so on that are encoded
@snicolast or @kijai or @RuneGjerde or @Brie-Wensleydale Please help me Number of prompts: 1 Section size: 42.0 context window seq len: 152544 Applying FreeNoise Context schedule enabled: 21 frames, 1 stride, 4 overlap Using SteadyDancer embeddings: cond_pos: torch.Size([16, 42, 226, 128]) cond_neg: None pose_strength_spatial: 1.0 pose_strength_temporal: 1.0 start_percent: 0.0 end_percent: 1.0 clip_fea: {'clip_embeds': tensor([[[ 0.8313, -0.6837, 0.1037, ..., 0.2401, 0.7478, 1.4284], [-0.1814, 0.0768, -0.1233, ..., 0.2015, -0.0954, -0.4034], [-0.1537, 0.4107, -0.0920, ..., 0.1780, -0.0231, -0.8815], ..., [-0.1201, 0.4920, -0.1202, ..., -0.0707, 0.0109, -0.8797], [ 0.1304, 0.2506, 0.0105, ..., -0.0982, 0.1991, -0.5053], [-0.1406, 0.3471, -0.2104, ..., 0.1502, 0.0788, -0.4737]]], device='cuda:0'), 'negative_clip_embeds': None} Input sequence length: 152544 Sampling 165 frames at 1024x1816 with 4 steps 0%| | 0/4 [00:00<?, ?it/s]Error during model prediction: shape '[21, 232448, 16]' is invalid for input of size 77758464 0%| | 0/4 [00:00<?, ?it/s] Error during sampling: shape '[21, 232448, 16]' is invalid for input of size 77758464 !!! Exception during processing !!! shape '[21, 232448, 16]' is invalid for input of size 77758464 Traceback (most recent call last): File "/root/comfy/ComfyUI/execution.py", line 515, in execute output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/comfy/ComfyUI/execution.py", line 329, in get_output_data return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/comfy/ComfyUI/execution.py", line 303, in _async_map_node_over_list await process_inputs(input_dict, i) File "/root/comfy/ComfyUI/execution.py", line 291, in process_inputs result = f(**inputs) ^^^^^^^^^^^ File "/root/comfy/ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/nodes_sampler.py", line 3200, in process return super().process(**sampler_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/comfy/ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/nodes_sampler.py", line 3135, in process raise e File "/root/comfy/ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/nodes_sampler.py", line 2099, in process noise_pred_context, _, new_teacache = predict_with_cfg( ^^^^^^^^^^^^^^^^^ File "/root/comfy/ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/nodes_sampler.py", line 1616, in predict_with_cfg raise e File "/root/comfy/ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/nodes_sampler.py", line 1486, in predict_with_cfg noise_pred_cond, noise_pred_ovi, cache_state_cond = transformer( ^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/comfy/ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/wanvideo/modules/model.py", line 2279, in forward condition_aligned = self.condition_embedding_align(condition_fused.float(), x_noise_clone).to(self.base_dtype) # Frame-wise Attention Alignment Unit. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/comfy/ComfyUI/custom_nodes/ComfyUI-WanVideoWrapper/steadydancer/small_archs.py", line 124, in forward out = self.cross_attn(query=r_trans, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/modules/activation.py", line 1488, in forward attn_output, attn_output_weights = F.multi_head_attention_forward( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/nn/functional.py", line 6375, in multi_head_attention_forward k = k.view(k.shape[0], bsz * num_heads, head_dim).transpose(0, 1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[21, 232448, 16]' is invalid for input of size 77758464
this is what I am doing based on the original code and kijai's answer. But there you can see the result looks awful compared to the video source. Any clues?