FastVideo icon indicating copy to clipboard operation
FastVideo copied to clipboard

[Bug] Latent size mismatch in distill script for Wan-Syn-Data-480P

Open EricLina opened this issue 2 months ago • 4 comments

Describe the bug

In the distill script https://github.com/hao-ai-lab/FastVideo/blob/4f3e8751db146c545f81156ddc469e53fb621cbd/examples/distill/Wan2.1-T2V/Wan-Syn-Data-480P/distill_dmd_VSA_t2v_1.3B.slurm#L58 the latent size is set to 21. However, the max frame size for Wan-Syn-Data-480P is 77, which means its latents shape is [20]. This results in a shape mismatch when running the script.

Although it is runnable, will this shape mismatch issue cause any problem?

Reproduction

bash FastVideo/examples/distill/Wan2.1-T2V/Wan-Syn-Data-480P/distill_dmd_VSA_t2v_1.3B.slurm

Environment

CUDA12.8

EricLina avatar Oct 21 '25 12:10 EricLina

If you’re using our dataset, you need to set it to 20. This won’t affect the quality, but will generate videos at an optimal 77 frames.

BrianChen1129 avatar Oct 24 '25 18:10 BrianChen1129

Thank you. I noticed that the VSA tuning scripts use different settings for latent size. Could you explain why the scripts set it to 16? https://github.com/hao-ai-lab/FastVideo/blob/50da62e722165a8847895a551aa56bc5ee2bb08c/scripts/finetune/finetune_v1_VSA.sh#L27

EricLina avatar Oct 25 '25 08:10 EricLina

I think this was due to the restrictions VSA placed on dimensions

SolitaryThinker avatar Oct 25 '25 08:10 SolitaryThinker

Thank you. I noticed that the VSA tuning scripts use different settings for latent size. Could you explain why the scripts set it to 16?

FastVideo/scripts/finetune/finetune_v1_VSA.sh

Line 27 in 50da62e

--num_latent_t 16 \

There are no restrictions. You can set it to any value.

BrianChen1129 avatar Oct 25 '25 09:10 BrianChen1129