elastic icon indicating copy to clipboard operation
elastic copied to clipboard

rendezvous: _matches_machine_hostname doesn't resolve hostnames fully

Open d4l3k opened this issue 2 years ago • 2 comments

🐛 Bug

Component (check all that applies):

  • [ ] state api
  • [ ] train_step api
  • [ ] train_loop
  • [x] rendezvous
  • [ ] checkpoint
  • [ ] rollback
  • [ ] metrics
  • [ ] petctl
  • [ ] examples
  • [ ] docker
  • [ ] other

To Reproduce

Steps to reproduce the behavior:

  1. Launch a 2 node job on Kubernetes+Volcano
  2. LOGLEVEL=INFO python -m torch.distributed.run --rdzv_backend c10d --rdzv_id 1 --rdzv_endpoint "$VC_SH_0_HOSTS" --nnodes 2 echo hello
  3. rendezvous times out since the rank 0 host doesn't realize it's the master due to insufficient hostname resolution
root@sh-db2kkt73p534vd-sh-0-0:/app# echo $VC_SH_0_HOSTS
sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd
root@sh-db2kkt73p534vd-sh-0-0:/app# hostname
sh-db2kkt73p534vd-sh-0-0
root@sh-db2kkt73p534vd-sh-0-0:/app# cat /etc/resolv.conf 
nameserver 10.100.0.10
search default.svc.cluster.local svc.cluster.local cluster.local us-west-2.compute.internal
options ndots:5
root@sh-db2kkt73p534vd-sh-0-0:/app# cat /etc/hosts    
# Kubernetes-managed hosts file.
127.0.0.1	localhost
::1	localhost ip6-localhost ip6-loopback
fe00::0	ip6-localnet
fe00::0	ip6-mcastprefix
fe00::1	ip6-allnodes
fe00::2	ip6-allrouters
192.168.15.246	sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd.default.svc.cluster.local	sh-db2kkt73p534vd-sh-0-0

The hostname is sh-db2kkt73p534vd-sh-0-0 but Volcano gives the addresss sh-db2kkt73p534vd-sh-0-0.sh-db2kkt73p534vd. Between hosts, resolve.conf and hostname there's all the information required to realize that these addresses are equivalent but the current logic isn't sufficient.

https://github.com/pytorch/pytorch/blob/1b745efbe8ee0ac3bae594ea88ff27e71a734c88/torch/distributed/elastic/rendezvous/utils.py#L110

We may want to do a full dns resolution on the address and check if it matches any of the local IP addresses.

Expected behavior

It realizes the host name is the current node and starts the c10d server.

Environment

  • torchelastic version (e.g. 0.1.0rc1):
  • OS (e.g., Linux): Linux sh-db2kkt73p534vd-sh-0-0 4.14.241-184.433.amzn2.x86_64 #1 SMP Wed Aug 4 14:35:15 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • How you installed torchelastic (conda, pip, source, docker): docker
  • Docker image and tag (if using docker): https://github.com/pytorch/torchx/pkgs/container/torchx/15644476?tag=0.1.2dev0
  • Build command you used (if compiling from source):
  • Git commit (if installed from source):
  • Python version: 3.7.11
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Execution environment (on-prem, aws, etc): EKS + Volcano
  • Any other relevant information:

Additional context

d4l3k avatar Feb 24 '22 19:02 d4l3k

is this from torch-1.10 or torchelastic-0.1.0rc1? if the former, then can you move this issue to pytorch and tag it with the "elastic" tag and assign it to me for now? thanks!

kiukchung avatar Feb 24 '22 20:02 kiukchung

1.10.0

d4l3k avatar Feb 24 '22 20:02 d4l3k