TransformerEngine
TransformerEngine copied to clipboard
[Question] Why Tensor parallel communication/GEMM overlap can happen only when sequence parallelism is enabled?
In Megatron, I find that the check for tp_comm_overlap and sequence_parallel。
if args.tp_comm_overlap:
assert args.sequence_parallel == True, 'Tensor parallel communication/GEMM overlap can happen only when sequence parallelism is enabled'
But why?
That is because we currently only support AllGather/ReduceScatter overlapping with GEMM (and those communication types are used when sequence parallelism is enabled, as opposed to AllReduce which is being used in the other cases).