[Bug] Latent size mismatch in distill script for Wan-Syn-Data-480P
Describe the bug
In the distill script https://github.com/hao-ai-lab/FastVideo/blob/4f3e8751db146c545f81156ddc469e53fb621cbd/examples/distill/Wan2.1-T2V/Wan-Syn-Data-480P/distill_dmd_VSA_t2v_1.3B.slurm#L58 the latent size is set to 21. However, the max frame size for Wan-Syn-Data-480P is 77, which means its latents shape is [20]. This results in a shape mismatch when running the script.
Although it is runnable, will this shape mismatch issue cause any problem?
Reproduction
bash FastVideo/examples/distill/Wan2.1-T2V/Wan-Syn-Data-480P/distill_dmd_VSA_t2v_1.3B.slurm
Environment
CUDA12.8
If you’re using our dataset, you need to set it to 20. This won’t affect the quality, but will generate videos at an optimal 77 frames.
Thank you. I noticed that the VSA tuning scripts use different settings for latent size. Could you explain why the scripts set it to 16? https://github.com/hao-ai-lab/FastVideo/blob/50da62e722165a8847895a551aa56bc5ee2bb08c/scripts/finetune/finetune_v1_VSA.sh#L27
I think this was due to the restrictions VSA placed on dimensions
Thank you. I noticed that the VSA tuning scripts use different settings for latent size. Could you explain why the scripts set it to 16?
FastVideo/scripts/finetune/finetune_v1_VSA.sh
Line 27 in 50da62e
--num_latent_t 16 \
There are no restrictions. You can set it to any value.