torchft
torchft copied to clipboard
torchelastic Rendezvous Backend
We want to be able to leverage torchft's fast quorum implementation for Lighthouse in order to do faster dynamic rendezvous for torchelastic.
Torchelastic has an entrypoints based mechanism for registering new backends at https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/registry.py#L63-L64
Key features we want to support:
- flexible lighthouse config: external lighthouse support + automatically starting lighthouse similar to c10d's TCPStore using the address
- scale up / scale down operations
- hot spares for fast restarts
References:
- https://packaging.python.org/en/latest/specifications/entry-points/
- https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/registry.py#L63-L64
- https://pytorch.org/docs/stable/elastic/rendezvous.html
- c10d rendezvous https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py#L214