About Frame Pack & 3d Rope
Feature request / 功能建议
Thanks for your work!
The paper mentioned Frame Pack, which requires to generate attention masks (https://arxiv.org/abs/2307.06304),
the forward function :
kwargs["input_ids"] = kwargs["position_ids"] = kwargs["attention_mask"] = torch.ones((1, 1)).to(x.dtype)
Confused about this mask generation...
And the 3D Rope is not involved in inference?
Motivation / 动机
Understanding the codes
Your contribution / 您的贡献
- This usage of mask is to use full attention in the sat framework
- Our 2B model does not use rope, and the subsequent models use rope
Are you releasing any subsequent model soon?
Does your current code include implementation of NaViT?
It seems that, although the paper mentions NaViT, the open sourced dataloader does not contain relevant code sections:
https://github.com/THUDM/CogVideo/blob/main/sat/data_video.py
- Just see section "update and news".
- Codes related to NaViT are only used for pretraining. We release codes for inference and finetuning, which don't need codes about NaViT. The release of these codes is a further plan.
Look forward to the release of the code regarding Navit.
Look forward too!!