volcano
volcano copied to clipboard
ORTE does not know how to route a message to the specified daemon
when i use volcano start a horovod tf job,the lm-horovod-job-master-0
node will run error and after restart 3 times, it'll be Running status, because it Permanently added 'lm-horovod-job-worker-0.lm-horovod-job,10.10.10.10' (ECDSA) to the list of known hosts
. the output info as below, anyone met or solved problem like this?
i use volcano 1.5.1, horovod 0.24.3, tf1.15, open mpi 4.0.0, cuda 10.0, ubuntu 18.04 on K8s docker image.
checkpoints data test.py tf_mnist_lm.py torch_mnist_lm.py
ssh: Could not resolve hostname lm-horovod-job-worker-5.lm-horovod-job: Name or service not known
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: lm-horovod-job-master-0
target node: lm-horovod-job-worker-0
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
[lm-horovod-job-master-0:00014] 6 more processes have sent help message help-errmgr-base.txt / no-path
[lm-horovod-job-master-0:00014] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Please post the yaml file of the job
Please post the yaml file of the job
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
# name: lm-hvd-job-tf-mnist
name: lm-horovod-job
# namespace: vc-horovod-test
# namespace: default
labels:
"volcano.sh/job-type": Horovod
spec:
minAvailable: 9
schedulerName: volcano
plugins:
ssh: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: master
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
WORKER_HOST=`cat /home/etc-volcano/volcano/worker.host | tr "\n" ","`;
mkdir -p /var/run/sshd; /usr/sbin/sshd;
cd /home/vc-hvd-test;
mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed=direct -np 8 python torch_mnist_lm.py;
image: vc-hvd-test:v1.0
name: master
ports:
- containerPort: 22
name: job-port
volumeMounts:
- name: vc-hvd-home
mountPath: /home
resources:
requests:
cpu: "500m"
memory: "1024Mi"
limits:
cpu: "500m"
memory: "1024Mi"
volumes:
- name: vc-hvd-home
persistentVolumeClaim:
claimName: nfs-pvc02
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
- replicas: 8
name: worker
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: vc-hvd-test:v1.0
name: worker
ports:
- containerPort: 22
name: job-port
volumeMounts:
- name: vc-hvd-home
mountPath: /home
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: vc-hvd-home
persistentVolumeClaim:
claimName: nfs-pvc02
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
---
Please try delayed start master.
https://github.com/volcano-sh/volcano/blob/2cfce7a1305e4ad6d3dcb1a11bf3dc528aee0701/example/task-start-dependency/mpi.yaml#L34
You can try use dependsOn.
Please try delayed start master.
https://github.com/volcano-sh/volcano/blob/2cfce7a1305e4ad6d3dcb1a11bf3dc528aee0701/example/task-start-dependency/mpi.yaml#L34
You can try use dependsOn.
I have tried use dependsOn
, but there is no matser job and the worker jobs always in Pending
status. After i deleted the job and apply it again, the output of kubectl get pod
is No resources found in default namespace
. I can only clean up volcano and reinstall it, if i try to use dependsOn
and apply a job, the matser and worker jobs will as same before. bellow is the .yaml file.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
# name: lm-hvd-job-tf-mnist
name: lm-horovod-job
# namespace: vc-horovod-test
# namespace: default
labels:
"volcano.sh/job-type": Horovod
spec:
minAvailable: 9
schedulerName: volcano
plugins:
ssh: []
svc: []
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: master
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
WORKER_HOST=`cat /home/etc-volcano/volcano/worker.host | tr "\n" ","`;
mkdir -p /var/run/sshd; /usr/sbin/sshd;
cd /home/vc-hvd-test;
mpiexec --allow-run-as-root --host ${WORKER_HOST} --mca routed=direct -np 8 python torch_mnist_lm.py;
image: vc-hvd-test:v1.0
name: master
ports:
- containerPort: 22
name: job-port
volumeMounts:
- name: vc-hvd-home
mountPath: /home
resources:
requests:
cpu: "500m"
memory: "1024Mi"
limits:
cpu: "500m"
memory: "1024Mi"
volumes:
- name: vc-hvd-home
persistentVolumeClaim:
claimName: nfs-pvc02
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
dependsOn:
name:
- "worker"
- replicas: 8
name: worker
template:
spec:
containers:
- command:
- /bin/sh
- -c
- |
mkdir -p /var/run/sshd; /usr/sbin/sshd -D;
image: vc-hvd-test:v1.0
name: worker
ports:
- containerPort: 22
name: job-port
volumeMounts:
- name: vc-hvd-home
mountPath: /home
resources:
limits:
nvidia.com/gpu: 1
volumes:
- name: vc-hvd-home
persistentVolumeClaim:
claimName: nfs-pvc02
restartPolicy: OnFailure
imagePullSecrets:
- name: default-secret
---
It is speculated that there is a conflict with the gang plugin.
You can disable gang plugin or set minavailable
== worker.replicas
@hwdef Perhaps we should consider more about mpi requirements and provides more test results at mpi plugin
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).
Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗