SSH issue when trying to deploy horovod mnist example
Following is the mpi-operator configuration file i am trying to deploy on our kubernetes cluster.
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: tensorflow-mnist
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: horovod/horovod:latest
name: mpi-launcher
command:
- mpirun
args:
- -np
- "2"
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- /examples/tensorflow2/tensorflow2_mnist.py
# resources:
# limits:
# cpu: 1
# memory: 2Gi
Worker:
replicas: 2
template:
spec:
containers:
- image: horovod/horovod:latest
name: mpi-worker
# resources:
# limits:
# cpu: 2
# memory: 4Gi
Following is the error i am getting in the launcher pod
Failed to add the host to the list of known hosts (/root/.ssh/known_hosts).
Failed to add the host to the list of known hosts (/root/.ssh/known_hosts).
Permission denied, please try again.
Permission denied, please try again.
[email protected]: Permission denied (publickey,password).
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: tensorflow-mnist-launcher
target node: tensorflow-mnist-worker-0.tensorflow-mnist-worker
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
What am i missing as i am unable to find anything on horovod docs or mpi operator
kubectl version Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:28:09Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.2", GitCommit:"faecb196815e248d3ecfb03c680a4507229c2a56", GitTreeState:"clean", BuildDate:"2021-01-13T13:20:00Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
mpi operator version - 0.3.0
As per the official docs of mpi operator this config is for tf v1.14 examples/v2beta1/tensorflow-benchmarks.yaml, The above config is for tfv2.5.0 based on horovod latest docker image. Can anyone tell me what step i am missing.
The upstream horovod image horovod/horovod:latest doesnt' support the operator directly.
You can use this dockerfile instead: https://github.com/kubeflow/mpi-operator/blob/master/examples/horovod/Dockerfile
You can use this file as a reference of what configurations the image needs https://github.com/kubeflow/mpi-operator/blob/master/examples/base/Dockerfile
Ok so I modified the docker image as follows
FROM horovod/horovod:latest
RUN echo "UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
CMD ["/bin/bash"]
And used the below mpi config
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: tensorflow-mnist
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: ramakrishna1592/mnist-horovod:v1
name: mpi-launcher
command:
- mpirun
args:
- -np
- "2"
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- /horovod/examples/tensorflow2/tensorflow2_mnist.py
# resources:
# limits:
# cpu: 1
# memory: 2Gi
Worker:
replicas: 2
template:
spec:
containers:
- image: ramakrishna1592/mnist-horovod:v1
name: mpi-worker
# resources:
# limits:
# cpu: 2
# memory: 4Gi
It worked and completed execution in 15m.
Couple of things i noticed, after the mpirun executes the launcher pod goes into CrashLoopBackOff state
Following is the logs of the pod

After sometime it moves into running state

Is this an issue with the config?? Is there any parameter in need to set
Those errors are expected/acceptable. mpi-operator handles the retries for you. The thing is that, depending on your k8s installation, the networking (DNS names) might take some time to setup.
@terrytangyuan I think we should have our fork of horovod images in here https://hub.docker.com/u/mpioperator
We just need the changes that @ramakrishnamamidi identified:
FROM horovod/horovod:latest
RUN echo "UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && \
sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
CMD ["/bin/bash"]
Can you create a repo for it?
Although it would be good to also add the configuration to be able to run as non-root. Similar to this https://github.com/kubeflow/mpi-operator/commit/fee9913c6c5ee657871cf8967ec7e8d773666ea5#diff-be50a3cb50e4eb471c7337dba6036a840f2cadb8faf1ab15c421e682dafd9842
Actually, isn't the above what we have as the tensorflow benchmarks image?
- https://github.com/kubeflow/mpi-operator/blob/master/examples/tensorflow-benchmarks/Dockerfile
- https://hub.docker.com/r/mpioperator/mpi-operator
@alculquicondor ,Hi,i had the same problem with the default ymal
kubectl apply -f examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml
kubectl logs -f tensorflow-benchmarks-imagenet-launcher-tnsxb

How to set up password-free login between containers? Is it necessary to re-bulid images using Dockerfile and replace the images in tensorflow-benchmarks.ymal?
Can you confirm which image these pods are using? If I remember correctly, the images in dockerhub where built using the Dockerfile in the repo.
Thanks for your reply, I am using the default image
containers:
- image: mpioperator/tensorflow-benchmarks:latest
Maybe I solved the problem, after I adjusted the calico-node state. before
kube-system calico-node-6c9mx 0/1 Running 0 10s
kube-system calico-node-dbndq 0/1 Running 0 10s
kube-system calico-node-qw6vv 0/1 Running 0 10s
after
kube-system calico-node-h2pnc 1/1 Running 0 11m
kube-system calico-node-npzn6 1/1 Running 0 11m
kube-system calico-node-smwpb 1/1 Running 0 11m
But I got a new error as below

That error looks like a problem in the application, which is outside of the scope of the operator. Did you install GPU drivers?
Thanks, my gpu driver version is 510.47.03, i solved this problem by updating the tf and cuda version in the image
That error looks like a problem in the application, which is outside of the scope of the operator. Did you install GPU drivers?
In the horovod image or the GPU daemonset?
If the horovod image, maybe it's worth upgrading our patched image.