pytorch-operator
pytorch-operator copied to clipboard
'host not found' error occurs during PyTorch distributed learning
During the PyTorch Job distributed learning, sometimes the 'Worker' cannot find the 'Master' with below message.
Traceback (most recent call last):
File "/workspace/src/bert/benchmark.py", line 2248, in <module>
main()
File "/workspace/src/bert/benchmark.py", line 2212, in main
torch.distributed.init_process_group(backend='nccl')
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
ValueError: host not found: Name or service not known
In pytorch job, 'worker' check connection with 'master' using 'nslookup' command as below, but the connection between 'master' and 'worker' might not be fully ready even if nslookup command succeeds.
command: ['sh', '-c', 'until nslookup {{.MasterAddr}}; do echo waiting for master; sleep 2; done;']`
So, I'm using 'netcat' command instead of 'nslookup'.
The following example shows that netcat test fails even if the nslookup test succeeds. netcat shows success within 4~10 sec after nslookup succeeds in my environment.
master address: pytorch-bert-test-g16-master-0
default port: 23456
used command:
- nslookup pytorch-bert-test-g16-master-0
- nc -w 1 -z pytorch-bert-test-g16-master-0 23456
nslookup: can't resolve 'pytorch-bert-test-g16-master-0': Name does not resolve <-- nslookup failure
nc: bad address 'pytorch-bert-test-g16-master-0'
netcat 1 <-- netcat failure
Name: pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local <-- nslookup succeess!
netcat 1 <-- netcat failure
Name: pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local <-- nslookup succeess!
netcat 1 <-- netcat failure
(tried several times...)
Name: pytorch-bert-test-g16-master-0
Address 1: 172.30.0.42 172-30-0-42.pytorch-bert-test-g16-master-0.default.svc.cluster.local <-- nslookup succeess!
netcat 0 <-- netcat succeess!
I guess there is a slight delay until virtual ip with the port is opened completely in k8s after service is created and endpoint is assigned.
So, Could you please check this issue?
And are there any plans to modify below code to pass the master port as a parameter as well as the master address when creating the init Container?
# pytorch-operator/pkg/controller.v1/pytorch/pod.go
...
if !masterRole {
masterAddr := jobcontroller.GenGeneralName(job.Name, strings.ToLower(string(pyv1.PyTorchReplicaTypeMaster)), strconv.Itoa(0))
err := AddInitContainerForWorkerPod(podTemplate, InitContainerParam{
MasterAddr: masterAddr,
InitContainerImage: pc.initContainerImage,
})
if err != nil {
return err
}
}
...
Because, I'm using 'netcat' command with hard-coded port, because only 'MasterAddr' is passed as a parameter when creating an init container.
Best regards!
And are there any plans to modify below code to pass the master port as a parameter as well as the master address when creating the init Container?
I think we should have it, thanks for the issue.
/kind feature