rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

RFC-0030 Consolidate TorchElastic and TorchX

Open kiukchung opened this issue 2 years ago • 1 comments

TorchX was originally created to help PyTorch users in OSS run their PyTorch applications on widely adopted infrastructure setups and schedulers. Today TorchX supports most AI infra setups which use SLURM, Kubernetes, Ray, Batch services (AWS, GCP, and Azure) and Kubernetes-MCAD (IBM). In the recent months we've seen TorchX gain traction as evidenced by several blog posts detailing how to use PyTorch on XYZ using TorchX:

  1. How to run PyTorch on Vertex AI using TorchX
  2. Large-scale distributed training with TorchX and Ray
  3. Scaling distributed training with AWS Trainium and Amazon EKS
  4. Rapidly deploy PyTorch applications on Batch using TorchX

While TorchX launches PyTorch jobs onto local and remote schedulers, TorchElastic (aka torchrun) is responsible for launching PyTorch processes (ranks). But for the user both tools run their training scripts and seemingly overlap in functionality.

This RFC proposes that:

  1. We consolidate TorchElastic and TorchX as a single module
  2. That we do so by:
    1. Upstreaming TorchX as torch.x (under a new submodule called x)
    2. Pull torch.distributed.elastic and put it under torch.x.run

kiukchung avatar Apr 18 '23 23:04 kiukchung

cc @soumith, @msaroufim, @d4l3k, @kurman, @priyaramani

kiukchung avatar Apr 19 '23 00:04 kiukchung