FastVideo [Bug] Latent size mismatch in distill script for Wan-Syn-Data-480P

Describe the bug

In the distill script https://github.com/hao-ai-lab/FastVideo/blob/4f3e8751db146c545f81156ddc469e53fb621cbd/examples/distill/Wan2.1-T2V/Wan-Syn-Data-480P/distill_dmd_VSA_t2v_1.3B.slurm#L58 the latent size is set to 21. However, the max frame size for Wan-Syn-Data-480P is 77, which means its latents shape is [20]. This results in a shape mismatch when running the script.

Although it is runnable, will this shape mismatch issue cause any problem?

Reproduction

bash FastVideo/examples/distill/Wan2.1-T2V/Wan-Syn-Data-480P/distill_dmd_VSA_t2v_1.3B.slurm

Environment

CUDA12.8

Oct 21 '25 12:10 EricLina

If you’re using our dataset, you need to set it to 20. This won’t affect the quality, but will generate videos at an optimal 77 frames.

Oct 24 '25 18:10 BrianChen1129

Thank you. I noticed that the VSA tuning scripts use different settings for latent size. Could you explain why the scripts set it to 16? https://github.com/hao-ai-lab/FastVideo/blob/50da62e722165a8847895a551aa56bc5ee2bb08c/scripts/finetune/finetune_v1_VSA.sh#L27

Oct 25 '25 08:10 EricLina

I think this was due to the restrictions VSA placed on dimensions

Oct 25 '25 08:10 SolitaryThinker

Thank you. I noticed that the VSA tuning scripts use different settings for latent size. Could you explain why the scripts set it to 16?

FastVideo/scripts/finetune/finetune_v1_VSA.sh

Line 27 in 50da62e

--num_latent_t 16 \

There are no restrictions. You can set it to any value.

Oct 25 '25 09:10 BrianChen1129