taozhiwei

Results 2 issues of taozhiwei

when using allgather, the output is a list, and in the implementation of torch, the list will be flattened and unflattened, which will result in additional allocation of GPU memory...

stale

[https://arxiv.org/abs/2406.06858v1 ](https://arxiv.org/abs/2406.06858v1 ) [https://github.com/bytedance/flux](https://github.com/bytedance/flux) Is Megatron planning to use flux technology?Integrating communication and gemm into one operator to improve overlap rate.