wan2.1 i2v image to video inference
while inference with denoise, generate much information, may be something wrong with tokenizer:
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([512], device='cuda:0', dtype=torch.int32) 512 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([512], device='cuda:0', dtype=torch.int32) 512 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([512], device='cuda:0', dtype=torch.int32) 512 False (-1, -1)
torch.Size([1, 42525, 40, 128])
10%|████████████▌ | 5/50 [02:22<21:25, 28.57s/it]tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([42525]) 42525 False (-1, -1)
torch.Size([1, 42525, 40, 128])
tensor([42525], device='cuda:0', dtype=torch.int32) 42525 tensor([257], device='cuda:0', dtype=torch.int32) 257 False (-1, -1)
torch.Size([1, 42525, 40, 128])
@trouble-maker007 Fixed. Please update the code. ^_^
Hey, can you help me? When I use wan_14b_image_to_video.py, I get an error: RuntimeError: The size of tensor a (4) must match the size of tensor b (3) at non-singleton dimension 1
How can I solve this?
@tiga-dudu Can you provide your code here?
@tiga-dudu Can you provide your code here?
Thanks, I have solved it. For png images, you need to convert them.
image = Image.open(img_path).convert("RGB")
What's the inference time for i2v? @tiga-dudu @Artiprocher
@donghaoye On my device, the inference time of i2v-14B is almost the same as t2v-14B. Please take a look at the table in our readme file.
https://github.com/modelscope/DiffSynth-Studio/tree/main/examples/wanvideo#wan-video-14b-t2v
It cost 25min on my RTX6000
not work with English Prompts?
https://drive.google.com/file/d/1y2iXLbJ7T63V6IHx5S-n4di8P_HDj-0B/view?usp=sharing
prompt="A small boat is bravely forging ahead through the wind and waves. The vast blue sea is turbulent, with white waves crashing against the hull, yet the boat remains undaunted, steadfastly sailing toward the distant horizon. Sunlight spills across the water, shimmering with golden brilliance, adding a touch of warmth to this magnificent scene. As the camera zooms in, the flag on the boat can be seen fluttering in the wind, symbolizing an indomitable spirit and the courage to venture into the unknown. This powerful and inspiring imagery captures the fearlessness and determination required to face challenges head-on.",
negative_prompt="Vivid tones, overexposed, static, unclear details, subtitles, style, artwork, painting, frame, still, overall grayish, worst quality, low quality, JPEG compression artifacts, unattractive, incomplete, extra fingers, poorly drawn hands, poorly drawn face, distorted, disfigured, malformed limbs, fused fingers, static image, cluttered background, three legs, crowded background figures, walking upside down.",
Not working
prompt="a fire rages in an apartment",
output https://drive.google.com/file/d/1tTJemXbrPgzRyzCcF3yCMhEJRKHN-RzS/view?usp=sharing
The result is not consistent with this https://replicate.com/wavespeedai/wan-2.1-i2v-720p/examples
The result is not consistent with this
torch_dtype=torch.bfloat16 fixed it.
@donghaoye Our implementation is different from the original repo, including:
- random_device: we generate the Gaussian noise using CPU, making the video consistent on different devices.
- scheduler: we use the standard flow matching scheduler, which is consistent with FLUX.