pytorch-operator 'host not found' error occurs during PyTorch distributed learning

'host not found' error occurs during PyTorch distributed learning

Open JGoo1 opened this issue 4 years ago • 1 comments

During the PyTorch Job distributed learning, sometimes the 'Worker' cannot find the 'Master' with below message.

Traceback (most recent call last):
  File "/workspace/src/bert/benchmark.py", line 2248, in <module>
    main()
  File "/workspace/src/bert/benchmark.py", line 2212, in main
    torch.distributed.init_process_group(backend='nccl')
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Name or service not known

In pytorch job, 'worker' check connection with 'master' using 'nslookup' command as below, but the connection between 'master' and 'worker' might not be fully ready even if nslookup command succeeds.

 command: ['sh', '-c', 'until nslookup {{.MasterAddr}}; do echo waiting for master; sleep 2; done;']`

So, I'm using 'netcat' command instead of 'nslookup'.

The following example shows that netcat test fails even if the nslookup test succeeds. netcat shows success within 4~10 sec after nslookup succeeds in my environment.

master address: pytorch-bert-test-g16-master-0
default port: 23456
used command: 
 - nslookup pytorch-bert-test-g16-master-0
 - nc -w 1 -z pytorch-bert-test-g16-master-0 23456

nslookup: can't resolve 'pytorch-bert-test-g16-master-0': Name does not resolve  <-- nslookup failure
nc: bad address 'pytorch-bert-test-g16-master-0'    
netcat 1   <-- netcat failure


Name:      pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local  <-- nslookup succeess!
netcat 1 <-- netcat failure


Name:      pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local <-- nslookup succeess!
netcat 1 <-- netcat failure


(tried several times...)


Name:      pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local <-- nslookup succeess!
netcat 0 <-- netcat succeess!

I guess there is a slight delay until virtual ip with the port is opened completely in k8s after service is created and endpoint is assigned.

So, Could you please check this issue?

And are there any plans to modify below code to pass the master port as a parameter as well as the master address when creating the init Container?

# pytorch-operator/pkg/controller.v1/pytorch/pod.go 
...
	if !masterRole {
		masterAddr := jobcontroller.GenGeneralName(job.Name, strings.ToLower(string(pyv1.PyTorchReplicaTypeMaster)), strconv.Itoa(0))
		err := AddInitContainerForWorkerPod(podTemplate, InitContainerParam{
			MasterAddr:         masterAddr,
			InitContainerImage: pc.initContainerImage,
		})
		if err != nil {
			return err
		}
	}
...

Because, I'm using 'netcat' command with hard-coded port, because only 'MasterAddr' is passed as a parameter when creating an init container.

Best regards!

Apr 30 '21 08:04 JGoo1

And are there any plans to modify below code to pass the master port as a parameter as well as the master address when creating the init Container?

I think we should have it, thanks for the issue.

/kind feature

May 06 '21 01:05 gaocegege

pytorch-operator pytorch-operator copied to clipboard

'host not found' error occurs during PyTorch distributed learning

pytorch-operator
pytorch-operator copied to clipboard