CogVideo Reason for 49 frames (extra split for interpolation)

N/A

N/A

I am finetuning the T2V model, and wanted to understand why we are required to have 4x+1 frame count.

I see that the DownSample3D module in the VAE will split the first frame off, and only interpolate the remaining frames.

https://github.com/THUDM/CogVideo/blob/8f1829f1cdb405a10023f9ba7a292799d4d698ff/sat/vae_modules/cp_enc_dec.py#L574

Why do we not set frames to 48, why do we need a frame that doesn't interpolate with others?

Jan 12 '25 04:01 karan-dalal

We follow magvit-v2 (https://arxiv.org/html/2310.05737v2). 4x+1 enable joint training with images and videos

Jan 14 '25 05:01 yzy-thu

If I'm only finetuning with videos, would it be better to just train without the extra 1?

Jan 14 '25 05:01 karan-dalal

same question

Jan 14 '25 13:01 zhuochen02

same question

Jan 24 '25 08:01 Wang-pengfei