[Question] CP and DP
Hi, this is a really great repo! Thanks for open-sourcing it!
I am reading the code of how torchtian handles the multi-dimensional parallelism. It seems the cp is a part of the mesh dimensions interacting with dp_shard, dp_replicate etc. My understanding of cp is that it is orthogonal to other parallelisms. For example, it is a validate configuration of dp_shard=8, dp_replicate=1 and cp=8 for a 8-GPU node. But according to the code, it will raise an error as dp_shard * cp != world_size.
https://github.com/pytorch/torchtitan/blob/6df8c8925bb2ba9b4e6aa88cece0e3f0633ab6ce/torchtitan/distributed/parallel_dims.py#L48
@galalalala The configuration you suggested is not valid. With your proposed sharding strategy, a global batch (assuming batch size is 8 and sequence size 8192) need to be first sharded on the batch dimension, so each rank dp_shard group will get a local batch with batch size 1. Then the local batch will be further sharded on the sequence dimension. So in the end, each rank will get a sharded batch with batch size 1 and sequence size 1024. We can not do this with only 8 GPUs, we need 64 GPUs.
Thank you for the quick reply. I see your point. In my config, I assume the CP group share the same batch of data. So each dp_shard rank takes 1/8 of the squence length in same the batch. In your case, if each dp_shard rank has a different data, that's indeed an invalid configuration. In torchtian, does implementation follow what you described above?
@galalalala Yes.