CogVideo About Frame Pack & 3d Rope

Feature request / 功能建议

Thanks for your work! The paper mentioned Frame Pack, which requires to generate attention masks (https://arxiv.org/abs/2307.06304), the forward function : kwargs["input_ids"] = kwargs["position_ids"] = kwargs["attention_mask"] = torch.ones((1, 1)).to(x.dtype) Confused about this mask generation... And the 3D Rope is not involved in inference?

Motivation / 动机

Understanding the codes

Your contribution / 您的贡献

Aug 07 '24 06:08 burnquiet

This usage of mask is to use full attention in the sat framework
Our 2B model does not use rope, and the subsequent models use rope

Aug 07 '24 12:08 tengjiayan20

Are you releasing any subsequent model soon?

Does your current code include implementation of NaViT?

Aug 08 '24 13:08 jinhuaca

It seems that, although the paper mentions NaViT, the open sourced dataloader does not contain relevant code sections:

https://github.com/THUDM/CogVideo/blob/main/sat/data_video.py

Aug 08 '24 13:08 jinhuaca

Just see section "update and news".
Codes related to NaViT are only used for pretraining. We release codes for inference and finetuning, which don't need codes about NaViT. The release of these codes is a further plan.

Aug 08 '24 16:08 tengjiayan20

Look forward to the release of the code regarding Navit.

Aug 23 '24 02:08 colian

Look forward too!!

Aug 23 '24 08:08 skeletonNN