DiffSynth-Studio wan2.1 i2v image to video inference

while inference with denoise, generate much information, may be something wrong with tokenizer:

torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([512], device='cuda:0', dtype=torch.int32) 512 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([512], device='cuda:0', dtype=torch.int32) 512 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([512], device='cuda:0', dtype=torch.int32) 512 False (-1, -1)
torch.Size([1, 42525, 40, 128])
 10%|████████████▌                                                                                                                | 5/50 [02:22<21:25, 28.57s/it]tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])

Feb 26 '25 02:02 trouble-maker007

@trouble-maker007 Fixed. Please update the code. ^_^

Feb 26 '25 02:02 Artiprocher

Hey, can you help me? When I use wan_14b_image_to_video.py, I get an error: RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 1

How can I solve this?

Feb 26 '25 07:02 tiga-dudu

@tiga-dudu Can you provide your code here?

Feb 26 '25 08:02 Artiprocher

@tiga-dudu Can you provide your code here?

Thanks, I have solved it. For png images, you need to convert them. image = Image.open(img_path).convert("RGB")

Feb 26 '25 10:02 tiga-dudu

What's the inference time for i2v? @tiga-dudu @Artiprocher

Feb 27 '25 15:02 donghaoye

@donghaoye On my device, the inference time of i2v-14B is almost the same as t2v-14B. Please take a look at the table in our readme file.

https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/wanvideo#wan-video-14b-t2v

Feb 28 '25 02:02 Artiprocher

It cost 25min on my RTX6000

Feb 28 '25 02:02 donghaoye

not work with English Prompts?

https://drive.google.com/file/d/1y2iXLbJ7T63V6IHx5S-n4di8P_HDj-0B/view?usp=sharing

prompt="A small boat is bravely forging ahead through the wind and waves. The vast blue sea is turbulent, with white waves crashing against the hull, yet the boat remains undaunted, steadfastly sailing toward the distant horizon. Sunlight spills across the water, shimmering with golden brilliance, adding a touch of warmth to this magnificent scene. As the camera zooms in, the flag on the boat can be seen fluttering in the wind, symbolizing an indomitable spirit and the courage to venture into the unknown. This powerful and inspiring imagery captures the fearlessness and determination required to face challenges head-on.",
negative_prompt="Vivid tones, overexposed, static, unclear details, subtitles, style, artwork, painting, frame, still, overall grayish, worst quality, low quality, JPEG compression artifacts, unattractive, incomplete, extra fingers, poorly drawn hands, poorly drawn face, distorted, disfigured, malformed limbs, fused fingers, static image, cluttered background, three legs, crowded background figures, walking upside down.",

Feb 28 '25 03:02 donghaoye

Not working

input: https://replicate.delivery/pbxt/MZaaEBkCFWggU7M2ieyaqoecWhL41ijxcnLNMfFIu7SlSn2h/ytvwvdg181rme0cmyngbdf0na0.png

prompt="a fire rages in an apartment",

output https://drive.google.com/file/d/1tTJemXbrPgzRyzCcF3yCMhEJRKHN-RzS/view?usp=sharing

The result is not consistent with this https://replicate.com/wavespeedai/wan-2.1-i2v-720p/examples

Feb 28 '25 04:02 donghaoye

The result is not consistent with this

torch_dtype=torch.bfloat16 fixed it.

Feb 28 '25 05:02 donghaoye

@donghaoye Our implementation is different from the original repo, including:

random_device: we generate the Gaussian noise using CPU, making the video consistent on different devices.
scheduler: we use the standard flow matching scheduler, which is consistent with FLUX.

Feb 28 '25 11:02 Artiprocher