DiffSynth-Studio icon indicating copy to clipboard operation
DiffSynth-Studio copied to clipboard

wan2.1 i2v image to video inference

Open trouble-maker007 opened this issue 9 months ago • 11 comments

while inference with denoise, generate much information, may be something wrong with tokenizer:

torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([512], device='cuda:0', dtype=torch.int32) 512 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([512], device='cuda:0', dtype=torch.int32) 512 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([512], device='cuda:0', dtype=torch.int32) 512 False (-1, -1)
torch.Size([1, 42525, 40, 128])
 10%|████████████▌                                                                                                                | 5/50 [02:22<21:25, 28.57s/it]tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])

trouble-maker007 avatar Feb 26 '25 02:02 trouble-maker007

@trouble-maker007 Fixed. Please update the code. ^_^

Artiprocher avatar Feb 26 '25 02:02 Artiprocher

Hey, can you help me? When I use wan_14b_image_to_video.py, I get an error: RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 1

How can I solve this?

tiga-dudu avatar Feb 26 '25 07:02 tiga-dudu

@tiga-dudu Can you provide your code here?

Artiprocher avatar Feb 26 '25 08:02 Artiprocher

@tiga-dudu Can you provide your code here?

Thanks, I have solved it. For png images, you need to convert them. image = Image.open(img_path).convert("RGB")

tiga-dudu avatar Feb 26 '25 10:02 tiga-dudu

What's the inference time for i2v? @tiga-dudu @Artiprocher

donghaoye avatar Feb 27 '25 15:02 donghaoye

@donghaoye On my device, the inference time of i2v-14B is almost the same as t2v-14B. Please take a look at the table in our readme file.

https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/wanvideo#wan-video-14b-t2v

Artiprocher avatar Feb 28 '25 02:02 Artiprocher

It cost 25min on my RTX6000

donghaoye avatar Feb 28 '25 02:02 donghaoye

not work with English Prompts?

https://drive.google.com/file/d/1y2iXLbJ7T63V6IHx5S-n4di8P_HDj-0B/view?usp=sharing

prompt="A small boat is bravely forging ahead through the wind and waves. The vast blue sea is turbulent, with white waves crashing against the hull, yet the boat remains undaunted, steadfastly sailing toward the distant horizon. Sunlight spills across the water, shimmering with golden brilliance, adding a touch of warmth to this magnificent scene. As the camera zooms in, the flag on the boat can be seen fluttering in the wind, symbolizing an indomitable spirit and the courage to venture into the unknown. This powerful and inspiring imagery captures the fearlessness and determination required to face challenges head-on.",
negative_prompt="Vivid tones, overexposed, static, unclear details, subtitles, style, artwork, painting, frame, still, overall grayish, worst quality, low quality, JPEG compression artifacts, unattractive, incomplete, extra fingers, poorly drawn hands, poorly drawn face, distorted, disfigured, malformed limbs, fused fingers, static image, cluttered background, three legs, crowded background figures, walking upside down.",

donghaoye avatar Feb 28 '25 03:02 donghaoye

The result is not consistent with this

torch_dtype=torch.bfloat16 fixed it.

donghaoye avatar Feb 28 '25 05:02 donghaoye

@donghaoye Our implementation is different from the original repo, including:

  • random_device: we generate the Gaussian noise using CPU, making the video consistent on different devices.
  • scheduler: we use the standard flow matching scheduler, which is consistent with FLUX.

Artiprocher avatar Feb 28 '25 11:02 Artiprocher