volcano icon indicating copy to clipboard operation
volcano copied to clipboard

ORTE does not know how to route a message to the specified daemon

Open kongjibai opened this issue 2 years ago • 7 comments

when i use volcano start a horovod tf job,the lm-horovod-job-master-0 node will run error and after restart 3 times, it'll be Running status, because it Permanently added 'lm-horovod-job-worker-0.lm-horovod-job,10.10.10.10' (ECDSA) to the list of known hosts. the output info as below, anyone met or solved problem like this? i use volcano 1.5.1, horovod 0.24.3, tf1.15, open mpi 4.0.0, cuda 10.0, ubuntu 18.04 on K8s docker image.

checkpoints  data  test.py  tf_mnist_lm.py  torch_mnist_lm.py
ssh: Could not resolve hostname lm-horovod-job-worker-5.lm-horovod-job: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default
  
* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   lm-horovod-job-master-0
  target node:  lm-horovod-job-worker-0

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[lm-horovod-job-master-0:00014] 6 more processes have sent help message help-errmgr-base.txt / no-path
[lm-horovod-job-master-0:00014] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

kongjibai avatar May 12 '22 02:05 kongjibai

Please post the yaml file of the job

hwdef avatar May 12 '22 03:05 hwdef

Please post the yaml file of the job

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  # name: lm-hvd-job-tf-mnist
  name: lm-horovod-job
  # namespace: vc-horovod-test
  # namespace: default
  labels:
    "volcano.sh/job-type": Horovod
spec:
  minAvailable: 9
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: master
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  WORKER_HOST=`cat /home/etc-volcano/volcano/worker.host | tr "\n" ","`;
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  cd /home/vc-hvd-test;
                  mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed=direct -np 8 python torch_mnist_lm.py;
              image: vc-hvd-test:v1.0
              name: master
              ports:
                - containerPort: 22
                  name: job-port
              volumeMounts:                       
              - name: vc-hvd-home
                mountPath: /home
              resources:
                requests:
                  cpu: "500m"
                  memory: "1024Mi"
                limits:
                  cpu: "500m"
                  memory: "1024Mi"
          volumes:
          - name: vc-hvd-home
            persistentVolumeClaim:                 
              claimName: nfs-pvc02
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret
    - replicas: 8
      name: worker
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: vc-hvd-test:v1.0
              name: worker
              ports:
                - containerPort: 22
                  name: job-port
              volumeMounts:                         
              - name: vc-hvd-home
                mountPath: /home
              resources:
                limits:
                  nvidia.com/gpu: 1                 
          volumes:
          - name: vc-hvd-home
            persistentVolumeClaim:                 
              claimName: nfs-pvc02
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret
---

kongjibai avatar May 12 '22 03:05 kongjibai

Please try delayed start master.

https://github.com/volcano-sh/volcano/blob/2cfce7a1305e4ad6d3dcb1a11bf3dc528aee0701/example/task-start-dependency/mpi.yaml#L34

You can try use dependsOn.

hwdef avatar May 12 '22 03:05 hwdef

Please try delayed start master.

https://github.com/volcano-sh/volcano/blob/2cfce7a1305e4ad6d3dcb1a11bf3dc528aee0701/example/task-start-dependency/mpi.yaml#L34

You can try use dependsOn.

I have tried use dependsOn, but there is no matser job and the worker jobs always in Pending status. After i deleted the job and apply it again, the output of kubectl get pod is No resources found in default namespace. I can only clean up volcano and reinstall it, if i try to use dependsOn and apply a job, the matser and worker jobs will as same before. bellow is the .yaml file.

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  # name: lm-hvd-job-tf-mnist
  name: lm-horovod-job
  # namespace: vc-horovod-test
  # namespace: default
  labels:
    "volcano.sh/job-type": Horovod
spec:
  minAvailable: 9
  schedulerName: volcano
  plugins:
    ssh: []
    svc: []
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: master
      policies:
        - event: TaskCompleted
          action: CompleteJob
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  WORKER_HOST=`cat /home/etc-volcano/volcano/worker.host | tr "\n" ","`;
                  mkdir -p /var/run/sshd; /usr/sbin/sshd;
                  cd /home/vc-hvd-test;
                  mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed=direct -np 8 python torch_mnist_lm.py;
              image: vc-hvd-test:v1.0
              name: master
              ports:
                - containerPort: 22
                  name: job-port
              volumeMounts:                       
              - name: vc-hvd-home
                mountPath: /home
              resources:
                requests:
                  cpu: "500m"
                  memory: "1024Mi"
                limits:
                  cpu: "500m"
                  memory: "1024Mi"
          volumes:
          - name: vc-hvd-home
            persistentVolumeClaim:                 
              claimName: nfs-pvc02
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret
       dependsOn:
        name: 
        - "worker"
    - replicas: 8
      name: worker
      template:
        spec:
          containers:
            - command:
                - /bin/sh
                - -c
                - |
                  mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
              image: vc-hvd-test:v1.0
              name: worker
              ports:
                - containerPort: 22
                  name: job-port
              volumeMounts:                         
              - name: vc-hvd-home
                mountPath: /home
              resources:
                limits:
                  nvidia.com/gpu: 1                 
          volumes:
          - name: vc-hvd-home
            persistentVolumeClaim:                 
              claimName: nfs-pvc02
          restartPolicy: OnFailure
          imagePullSecrets:
            - name: default-secret
---

kongjibai avatar May 12 '22 10:05 kongjibai

It is speculated that there is a conflict with the gang plugin. You can disable gang plugin or set minavailable == worker.replicas

hwdef avatar May 12 '22 11:05 hwdef

@hwdef Perhaps we should consider more about mpi requirements and provides more test results at mpi plugin

Thor-wl avatar May 13 '22 03:05 Thor-wl

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] avatar Aug 12 '22 22:08 stale[bot]