RFC-0030 Consolidate TorchElastic and TorchX
TorchX was originally created to help PyTorch users in OSS run their PyTorch applications on widely adopted infrastructure setups and schedulers. Today TorchX supports most AI infra setups which use SLURM, Kubernetes, Ray, Batch services (AWS, GCP, and Azure) and Kubernetes-MCAD (IBM). In the recent months we've seen TorchX gain traction as evidenced by several blog posts detailing how to use PyTorch on XYZ using TorchX:
- How to run PyTorch on Vertex AI using TorchX
- Large-scale distributed training with TorchX and Ray
- Scaling distributed training with AWS Trainium and Amazon EKS
- Rapidly deploy PyTorch applications on Batch using TorchX
While TorchX launches PyTorch jobs onto local and remote schedulers, TorchElastic (aka torchrun) is responsible for launching PyTorch processes (ranks). But for the user both tools run their training scripts and seemingly overlap in functionality.
This RFC proposes that:
- We consolidate TorchElastic and TorchX as a single module
- That we do so by:
- Upstreaming TorchX as
torch.x(under a new submodule calledx) - Pull
torch.distributed.elasticand put it undertorch.x.run
- Upstreaming TorchX as
cc @soumith, @msaroufim, @d4l3k, @kurman, @priyaramani