TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

MPI Dependency for Computation-Communication Overlapping in Tensor Parallelism

Open zhipeng93 opened this issue 11 months ago • 1 comments

Hi,

I've noticed that you have implemented that allows for the overlapping of computation and communication in tensor parallel operations. This is a significant enhancement that has the potential to increase efficiency in distributed training workflows.

However, during my attempts to deploy jobs using torchrun on a Kubernetes (k8s) cluster, I encountered an issue where this overlapping feature does not operate as expected. It appears that the current implementation has a dependency on MPI for certain initialization procedures, which may not be fully compatible with the k8s environment.

Given the growing trend of containerized deployments and the adoption of Kubernetes for distributed jobs, I was wondering if there are any plans to abstract away or remove the MPI dependency for this feature.

Thanks!

zhipeng93 avatar Mar 04 '24 08:03 zhipeng93

@denera is currently working on lifting the MPI requirement for that overlap.

ptrendx avatar Mar 26 '24 02:03 ptrendx