ColossalAI
ColossalAI copied to clipboard
The divisible factor of local world size in sequence parallelism.
I understand that the grad_k should be the reduced sum among local ranks since Q is only a sub-sequence in each rank.
However, I do no quite understand why a divisible factor of local world size should be used in the following:
https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L63
As a comparison, the column-wise tensor parallelism does not introduce any divisible factor after the reduced sum. Any differences?
https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_1d/_utils.py#L36
An additional (perhaps not that related) question would be that why not using ring-all-reduce in the following:
https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L61
Hi, thanks for your questions!
Firstly, gradients are summed up after allreduce() operation and require to be divided by the number of ranks to get the mean value (Of course, you may set op=ReduceOp.AVG directly).
Secondly, this step is merely a wrap-up function for torch allreduce, and hence it only has to achieve reduce sum operation.
Thirdly, this is probably because torch does not have an API for ring allreduce at the time of writing this script. But it is definitely possible.
For the avg problem, I am actually wondering why avg is correct rather than sum? There must be something I do not take into consideration.
Here is my rationale, which is very similar to a reduce-scatter op:
- compute the grad locally
https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L59
- reduce the grad since Q being used is a sub-sequence
https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L61
- index the grad since K being used is a sub-sequence from the previously gathered full-sequence
https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L62
For the ring-all-reduce problem, it is actually implemented in the sequence parallelism, right?
https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/communication/ring.py#L11
Yep, your reasoning makes sense to me. For ring-all-reduce, I believe it is a potential improvement to have. If you are interested to contribute to this project, you may benchmark the performance improvements and let us know if you have any findings. Thank you!
Hi @GeneZC Welcome to share your findings with new issues or discussion. This issue was closed due to inactivity. Thanks.