ColossalAI The divisible factor of local world size in sequence parallelism.

I understand that the grad_k should be the reduced sum among local ranks since Q is only a sub-sequence in each rank. However, I do no quite understand why a divisible factor of local world size should be used in the following:

https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L63

As a comparison, the column-wise tensor parallelism does not introduce any divisible factor after the reduced sum. Any differences?

https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_1d/_utils.py#L36

An additional (perhaps not that related) question would be that why not using ring-all-reduce in the following:

https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L61

Mar 08 '23 02:03 GeneZC

Hi, thanks for your questions!

Firstly, gradients are summed up after allreduce() operation and require to be divided by the number of ranks to get the mean value (Of course, you may set op=ReduceOp.AVG directly).

Secondly, this step is merely a wrap-up function for torch allreduce, and hence it only has to achieve reduce sum operation.

Thirdly, this is probably because torch does not have an API for ring allreduce at the time of writing this script. But it is definitely possible.

Mar 08 '23 09:03 JThh

For the avg problem, I am actually wondering why avg is correct rather than sum? There must be something I do not take into consideration.

Here is my rationale, which is very similar to a reduce-scatter op:

compute the grad locally

https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L59

reduce the grad since Q being used is a sub-sequence

https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L61

index the grad since K being used is a sub-sequence from the previously gathered full-sequence

https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/nn/layer/parallel_sequence/_operation.py#L62

For the ring-all-reduce problem, it is actually implemented in the sequence parallelism, right?

https://github.com/hpcaitech/ColossalAI/blob/8fedc8766a7fc0c072337ac348b02b5da1037861/colossalai/communication/ring.py#L11

Mar 08 '23 10:03 GeneZC

Yep, your reasoning makes sense to me. For ring-all-reduce, I believe it is a potential improvement to have. If you are interested to contribute to this project, you may benchmark the performance improvements and let us know if you have any findings. Thank you!

Mar 09 '23 02:03 JThh

Hi @GeneZC Welcome to share your findings with new issues or discussion. This issue was closed due to inactivity. Thanks.

Apr 27 '23 10:04 binmakeswell

ColossalAI ColossalAI copied to clipboard

The divisible factor of local world size in sequence parallelism.

ColossalAI
ColossalAI copied to clipboard