TransformerEngine
TransformerEngine copied to clipboard
MPI Dependency for Computation-Communication Overlapping in Tensor Parallelism
Hi,
I've noticed that you have implemented that allows for the overlapping of computation and communication in tensor parallel operations. This is a significant enhancement that has the potential to increase efficiency in distributed training workflows.
However, during my attempts to deploy jobs using torchrun on a Kubernetes (k8s) cluster, I encountered an issue where this overlapping feature does not operate as expected. It appears that the current implementation has a dependency on MPI for certain initialization procedures, which may not be fully compatible with the k8s environment.
Given the growing trend of containerized deployments and the adoption of Kubernetes for distributed jobs, I was wondering if there are any plans to abstract away or remove the MPI dependency for this feature.
Thanks!
@denera is currently working on lifting the MPI requirement for that overlap.