[Question] Why Tensor parallel communication/GEMM overlap can happen only when sequence parallelism is enabled?

Open hxdtest opened this issue 1 year ago • 2 comments

In Megatron, I find that the check for tp_comm_overlap and sequence_parallel。

if args.tp_comm_overlap:         
        assert args.sequence_parallel == True, 'Tensor parallel communication/GEMM overlap can happen only when sequence parallelism is enabled'

But why?

Apr 03 '24 09:04 hxdtest

That is because we currently only support AllGather/ReduceScatter overlapping with GEMM (and those communication types are used when sequence parallelism is enabled, as opposed to AllReduce which is being used in the other cases).

Apr 09 '24 20:04 ptrendx