jinhuaca comments

Results 7 comments of


                                            jinhuaca

[BUG] Sequence Parallel(Ulysses) Training Gradient Scaling Issue

Why do we need to do the division ` tensor.div_(dist.get_world_size(group=self.dp_process_group) / float(self.sequence_parallel_size)) `? Is it to scale the loss function up according to the sequence parallel group size? Why not...

When using pure DeepSpeed ulysses and zero stage 3 to continue pre-training, the loss gap between each GPU is too large.[BUG]

It doesn't see that we need the sequence parallel-aware loss function according to this issue though: https://github.com/microsoft/DeepSpeed/issues/5248 It seems that this has been handled implicitly by Deepspeed Ulysses, right? @samadejacobs

Great work! when are you planning to release image-to-video models?

Do you do conditioning similarly in i2v models as compared to t2v models? For example, do you concatenate the image embeddings (instead of text embeddings) with the video tokens as...

There was a strange computation error between standard attention and flash-attention2

This paper also talks about instabilities of flash attention: https://arxiv.org/pdf/2405.02803v1

Missing keys: ['pos_embed', 'pos_embed_temporal']

Both the pos_emd are not learnable. Hence it is expected that they are not in the saved checkpoints.

About Frame Pack & 3d Rope

Are you releasing any subsequent model soon? Does your current code include implementation of NaViT?

About Frame Pack & 3d Rope

It seems that, although the paper mentions NaViT, the open sourced dataloader does not contain relevant code sections: https://github.com/THUDM/CogVideo/blob/main/sat/data_video.py