jinhuaca
jinhuaca
Why do we need to do the division ` tensor.div_(dist.get_world_size(group=self.dp_process_group) / float(self.sequence_parallel_size)) `? Is it to scale the loss function up according to the sequence parallel group size? Why not...
It doesn't see that we need the sequence parallel-aware loss function according to this issue though: https://github.com/microsoft/DeepSpeed/issues/5248 It seems that this has been handled implicitly by Deepspeed Ulysses, right? @samadejacobs
Do you do conditioning similarly in i2v models as compared to t2v models? For example, do you concatenate the image embeddings (instead of text embeddings) with the video tokens as...
This paper also talks about instabilities of flash attention: https://arxiv.org/pdf/2405.02803v1
Both the pos_emd are not learnable. Hence it is expected that they are not in the saved checkpoints.
Are you releasing any subsequent model soon? Does your current code include implementation of NaViT?
It seems that, although the paper mentions NaViT, the open sourced dataloader does not contain relevant code sections: https://github.com/THUDM/CogVideo/blob/main/sat/data_video.py