Megatron-LM
Megatron-LM copied to clipboard
use _all_gather_base instead of all_gather
when using allgather, the output is a list, and in the implementation of torch, the list will be flattened and unflattened, which will result in additional allocation of GPU memory and D2D operations. But these all gather operations already have a flat GPU memory, using _all_gather_base replaces all_gather will save GPU memory allocation and additional D2D operations.
Marking as stale. No activity in 60 days.