mpi-operator
mpi-operator copied to clipboard
Cannot run MPIJobs on OpenShift: hosts_of_nodes: Cannot open: Permission denied
Hello,
I am unable to run MPIJobs with the current latest
container image:
$ oc logs pod/mpi-mesher-launcher-jx64z
+ POD_NAME=mpi-mesher-worker-0
+ '[' m = - ']'
+ shift
+ /opt/kube/kubectl cp /opt/kube/hosts mpi-mesher-worker-0:/etc/hosts_of_nodes
tar: hosts_of_nodes: Cannot open: Permission denied
tar: Exiting with failure status due to previous errors
command terminated with exit code 2
+ /opt/kube/kubectl exec mpi-mesher-worker-0 -- /bin/sh -c 'cat /etc/hosts_of_nodes >> /etc/hosts && orted -mca ess "env" -mca ess_base_jobid "421724160" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex "mpi-mesher-launcher-jx[2:64]z,mpi-mesher-worker-[1:0]@0(2)" -mca orte_hnp_uri "421724160.0;tcp://10.129.2.27:41919" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "421724160.0;tcp://10.129.2.27:41919" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"'
/bin/sh: /etc/hosts: Permission denied
command terminated with exit code 1
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
It seems to be caused by this patch: https://github.com/kubeflow/mpi-operator/commit/07bbb45de95332c98b5787d846e8786f9670a5cb#diff-14bbcd1fe291a13ef9ceac0a33e03e0dR976 and indeed it was working property earlier on in July.
When I force the image tag to be v0.2.3
(for mpioperator/mpi-operator
and mpioperator/kubectl-delivery
), the job runs again as expected.
$ oc version
Client Version: 4.4.6
Server Version: 4.4.10
Kubernetes Version: v1.17.1+9d33dd3
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
kind/bug | 0.72 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
@milkybird98 Could you help take a look?
The issue is (seems to be at least) that in OpenShift, the pods run with user privileges, so /etc/hosts...
cannot be modified this way.
A workaround would be commenting out the lines that modify /etc/hosts
and then pass the correct args to mpirun
manually.
When using openmpi, the lines that modify /etc/hosts
are redundant, the orte process manager does not need to resolve the hostnames of mpi nodes.
But the current hydra process manager used by mpich and IntelMPI must resolve the hostnames of mpi nodes, they don't pass the ip address.
The reason why I choose to modify the /etc/hosts
file directly instead of using DNS is mainly because the dns server that comes with k8s uses fdqn hostname of pods, which is not convenient to use.
In addition, due to the need to obtain the ip address of the pods in running, HostAliases
cannot be used to maintain the /etc/hosts
file either.
Yes, same issue with OpenMPI using the latest version. Reverting mpi-operator to v0.2.3 fixes the permission error.
I have a question about using MPICH. My yaml file is shown below. Why all the MPI processes are running on the launcher nodes? I believe what I need to add is to specify MPIDistribution: "MPICH" in order to switch to MPICH.
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: hello-mpich
spec:
slotsPerWorker: 8
MPIDistribution: "MPICH"
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: uk.icr.io/hf-roks-uk/ubuntu-18.04-devel-mpich:v0
name: hello-mpich
command:
- mpirun
- -np
- "16"
- -bind-to
- none
- -map-by
- slot
- -ppn
- "8"
- /codes/HelloMPI/hello
imagePullPolicy: Always
imagePullSecrets:
- name: all-icr-io
Worker:
replicas: 2
template:
spec:
containers:
- image: uk.icr.io/hf-roks-uk/ubuntu-18.04-devel-mpich:v0
name: hello-mpich
resource:
limits:
nvidia.com/gpu: 0
imagePullSecrets:
- name: all-icr-io
Any update on this?
Hi, have you tried version 0.3.0? Note that it might require recreating the Job.
for current kube 1.22, I guess I must deploy v1
, but documentation still points to v2beta1
but that doesn't work with kube 1.22 / nor OCP 4.9 that is using 1.22
What's the current, updated, way to deploy the operator?
v2beta1 is newer than v1 and it's supported in kubernetes 1.22. The README is up to date.
FYI, I got the MPI-Operator v0.3.0 running on OpenShift (4.9), here are some notes about how I proceeded (do not assume it's safe, it's only working):
Deployment of the operator
oc apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.3.0/deploy/v2beta1/mpi-operator.yaml
oc set image -n mpi-operator deployment.apps/mpi-operator mpi-operator=docker.io/mpioperator/mpi-operator:0.3.0
Configuration of the namespace, to allow the default service account to run pods as root:
# oc new-project $MY_PROJECT
# oc adm policy add-scc-to-user privileged -z default # (from $MY_PROJECT namespace)
# oc adm policy add-scc-to-user anyuid -z default # (from $MY_PROJECT namespace)
Setup of the container image:
RUN dnf -y -q \
install sudo pkg-config vim make gdb \
curl git openssh-clients openssh-server \
gcc-gfortran gcc-c++ \
openmpi-devel openmpi \
iputils \
&& ln -s /usr/lib64/openmpi/bin/orted /usr/bin/orted \
&& ssh-keygen -A \
&& (echo "Host *"; echo " StrictHostKeyChecking no") >> /etc/ssh/ssh_config.d/no_StrictHostKeyChecking.conf \
&& echo "StrictModes no" >> /etc/ssh/sshd_config \
&& touch /var/log/lastlog \
&& chgrp utmp /var/log/lastlog \
&& chmod 664 /var/log/lastlog
Besides, in v0.3.0, the Launcher Pod starts its life by failing until the worker Pods are up and running (and waiting for incoming SSH connections).
It will fail because the <worker-pod>.<service>
hostname is unavailable until the worker Pods are running.
At the first glance, this isn't an issue, as the launcher Pod retries forever, and eventually the worker Pods are running and the MPI execution can actually start. However, this makes automation impossible, as we cannot distinguish a genuine failure from a "not-ready" failure.
This init container works around the problem, by preventing the main container from running before the worker Pods can be reached.
The first loop waits for the host list file to be populated (/etc/mpi/discover_hosts.sh
). I'm not 100% sure that the exit 1
is mandatory, I thought ConfigMap files would be updated automatically in the Pods, but in my quick tries it didn't happen. To be confirmed.
The second loop waits for the ssh
connection to the worker Pod to work.
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
initContainers:
- name: wait-hostfilename
image: <any image>
command:
- bash
- -cx
- "[[ $(cat /etc/mpi/discover_hosts.sh | wc -l) != 1 ]] && (date; echo Ready; cat /etc/mpi/discover_hosts.sh) || (date; echo 'not ready ...'; sleep 10; exit 1) && while read host; do while ! ssh $host echo $host ; do date; echo \"Pod $host is not up ...\"; sleep 10; done; date; echo \"Pod $host is ready\"; done <<< \"$(/etc/mpi/discover_hosts.sh)\""
volumeMounts:
- mountPath: /etc/mpi
name: mpi-job-config
- mountPath: /root/.ssh
name: ssh-auth
In addition to what's above, the Worker
Pod must run with privileges:
Worker:
replicas: {{ .Nproc }}
template:
spec:
containers:
- name: mpi-worker
image: {{ .Image }}
imagePullPolicy: Always
securityContext:
privileged: true
otherwise the ssh
connection will fail with this error:
chroot("/var/empty/sshd"): Operation not permitted [preauth]
linux_audit_write_entry failed: Operation not permitted
If you change the sshd port number, you might not need the container to be privileged