CogVideo icon indicating copy to clipboard operation
CogVideo copied to clipboard

Reason for 49 frames (extra split for interpolation)

Open karan-dalal opened this issue 1 year ago • 4 comments

System Info / 系統信息

N/A

Information / 问题信息

  • [ ] The official example scripts / 官方的示例脚本
  • [ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

N/A

Expected behavior / 期待表现

I am finetuning the T2V model, and wanted to understand why we are required to have 4x+1 frame count.

I see that the DownSample3D module in the VAE will split the first frame off, and only interpolate the remaining frames.

https://github.com/THUDM/CogVideo/blob/8f1829f1cdb405a10023f9ba7a292799d4d698ff/sat/vae_modules/cp_enc_dec.py#L574

Why do we not set frames to 48, why do we need a frame that doesn't interpolate with others?

karan-dalal avatar Jan 12 '25 04:01 karan-dalal

We follow magvit-v2 (https://arxiv.org/html/2310.05737v2). 4x+1 enable joint training with images and videos

yzy-thu avatar Jan 14 '25 05:01 yzy-thu

If I'm only finetuning with videos, would it be better to just train without the extra 1?

karan-dalal avatar Jan 14 '25 05:01 karan-dalal

same question

zhuochen02 avatar Jan 14 '25 13:01 zhuochen02

same question

Wang-pengfei avatar Jan 24 '25 08:01 Wang-pengfei