pytorch-operator
pytorch-operator copied to clipboard
dist.init_process_group stuck
Hi, I'm trying to start distributed training by kubeflow pytorchjob. However, the
dist.init_process_group(backend="nccl", init_method='tcp://'+args.master_addr+':'+args.master_port, world_size=args.world_size, rank=args.rank)
doesn't work. If I use os.environ["MASTER_PORT"] as the args.master_addr, it says
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, *args) File "/home/admin/aihub/pytorchjobDemo2.py", line 115, in main_worker dist.init_process_group(backend="nccl", init_method='tcp://'+args.master_addr+':'+args.master_port, world_size=args.world_size, rank=args.rank) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) ValueError: host not found: Name or service not known
If I put pod ip (os.environ["RequestedIP"] which I think if the ip of master pod), it would just hang there for a long time.
Do you know what I should use? Thanks!
Can you verify that the service works well in your environment?
We will create a headless service and use that service name to run the training job. Seems that the service discovery is broken.
Can you verify that the service works well in your environment?
Thanks for replying. How can I verify that?
We will create a headless service and use that service name to run the training job. Seems that the service discovery is broken.
Could you please provide more explanation? I'm not familiar with the terms... sorry about that!
https://kubernetes.io/docs/concepts/services-networking/service/#headless-services
I didn't create any service. So I should create a headless service first, then deploy the pytorchjob yaml file, right? Or install some add-on like CoreDNS on the k8s cluster?
No, pytorch-operator will create such a headless service for you. But pytorch-operator does not guarantee that the serivce works well in your k8s cluster.
Thank you very much for the hint. When I deployed a job, it says
Events: Type Reason Age From Message
Warning SettedPodTemplateRestartPolicy 6s (x2 over 6s) PytorchController Restart policy in pod template will be overwritten by restart policy in replica spec Normal SuccessfulCreatePod 6s PytorchController Created pod: demo2-224-pre-master-0 Normal SuccessfulCreateService 6s PytorchController Created service: demo2-224-pre-master-0 Normal SuccessfulCreatePod 6s PytorchController Created pod: demo2-224-pre-worker-0
I checked demo2-224-pre-master-0, it's indeed not working properly
Then I created a headless service
apiVersion: v1 kind: Service metadata: name: pytorchjob-headless-service spec: clusterIP: None selector: app: pytorchjob-headless-service-selector ports: - protocol: TCP port: 23456 targetPort: 23456
and add
apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: labels: jobtype: pytorchjob platform: k8s run: demo2-224-pre name: demo2-224-pre namespace: ailabs spec: pytorchReplicaSpecs: Master: backoffLimit: 0 replicas: 1 template: metadata: annotations: ... sidecar.istio.io/inject: "false" labels: ... app: pytorchjob-headless-service-selector ...
and deployed again. When checking pytorchjob-headless-service, there is still no pod attached to it. Where did I do wrong?
Thanks!!
Do you have wechat or are you in Slack?
just sent a friend request. thank you very much for your time!!