RFC-0030 Consolidate TorchElastic and TorchX

Open kiukchung opened this issue 2 years ago • 1 comments

TorchX was originally created to help PyTorch users in OSS run their PyTorch applications on widely adopted infrastructure setups and schedulers. Today TorchX supports most AI infra setups which use SLURM, Kubernetes, Ray, Batch services (AWS, GCP, and Azure) and Kubernetes-MCAD (IBM). In the recent months we've seen TorchX gain traction as evidenced by several blog posts detailing how to use PyTorch on XYZ using TorchX:

While TorchX launches PyTorch jobs onto local and remote schedulers, TorchElastic (aka torchrun) is responsible for launching PyTorch processes (ranks). But for the user both tools run their training scripts and seemingly overlap in functionality.

This RFC proposes that:

We consolidate TorchElastic and TorchX as a single module
That we do so by:
1. Upstreaming TorchX as torch.x (under a new submodule called x)
2. Pull torch.distributed.elastic and put it under torch.x.run

Apr 18 '23 23:04 kiukchung

cc @soumith, @msaroufim, @d4l3k, @kurman, @priyaramani

Apr 19 '23 00:04 kiukchung