taozhiwei
Results
2
issues of
taozhiwei
when using allgather, the output is a list, and in the implementation of torch, the list will be flattened and unflattened, which will result in additional allocation of GPU memory...
stale
[https://arxiv.org/abs/2406.06858v1 ](https://arxiv.org/abs/2406.06858v1 ) [https://github.com/bytedance/flux](https://github.com/bytedance/flux) Is Megatron planning to use flux technology?Integrating communication and gemm into one operator to improve overlap rate.