torchft icon indicating copy to clipboard operation
torchft copied to clipboard

torchelastic Rendezvous Backend

Open d4l3k opened this issue 7 months ago • 2 comments

We want to be able to leverage torchft's fast quorum implementation for Lighthouse in order to do faster dynamic rendezvous for torchelastic.

Torchelastic has an entrypoints based mechanism for registering new backends at https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/registry.py#L63-L64

Key features we want to support:

  • flexible lighthouse config: external lighthouse support + automatically starting lighthouse similar to c10d's TCPStore using the address
  • scale up / scale down operations
  • hot spares for fast restarts

References:

  • https://packaging.python.org/en/latest/specifications/entry-points/
  • https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/registry.py#L63-L64
  • https://pytorch.org/docs/stable/elastic/rendezvous.html
  • c10d rendezvous https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py#L214

d4l3k avatar Apr 25 '25 21:04 d4l3k