Reason for 49 frames (extra split for interpolation)
System Info / 系統信息
N/A
Information / 问题信息
- [ ] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务
Reproduction / 复现过程
N/A
Expected behavior / 期待表现
I am finetuning the T2V model, and wanted to understand why we are required to have 4x+1 frame count.
I see that the DownSample3D module in the VAE will split the first frame off, and only interpolate the remaining frames.
https://github.com/THUDM/CogVideo/blob/8f1829f1cdb405a10023f9ba7a292799d4d698ff/sat/vae_modules/cp_enc_dec.py#L574
Why do we not set frames to 48, why do we need a frame that doesn't interpolate with others?
We follow magvit-v2 (https://arxiv.org/html/2310.05737v2). 4x+1 enable joint training with images and videos
If I'm only finetuning with videos, would it be better to just train without the extra 1?
same question
same question