mpi-operator Cannot run MPIJobs on OpenShift: hosts_of_nodes: Cannot open: Permission denied

Hello,

I am unable to run MPIJobs with the current latest container image:

$ oc logs pod/mpi-mesher-launcher-jx64z
+ POD_NAME=mpi-mesher-worker-0
+ '[' m = - ']'
+ shift
+ /opt/kube/kubectl cp /opt/kube/hosts mpi-mesher-worker-0:/etc/hosts_of_nodes
tar: hosts_of_nodes: Cannot open: Permission denied
tar: Exiting with failure status due to previous errors
command terminated with exit code 2
+ /opt/kube/kubectl exec mpi-mesher-worker-0 -- /bin/sh -c 'cat /etc/hosts_of_nodes >> /etc/hosts &&  orted -mca ess "env" -mca ess_base_jobid "421724160" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex "mpi-mesher-launcher-jx[2:64]z,mpi-mesher-worker-[1:0]@0(2)" -mca orte_hnp_uri "421724160.0;tcp://10.129.2.27:41919" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "421724160.0;tcp://10.129.2.27:41919" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"'
/bin/sh: /etc/hosts: Permission denied
command terminated with exit code 1
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.

It seems to be caused by this patch: https://github.com/kubeflow/mpi-operator/commit/07bbb45de95332c98b5787d846e8786f9670a5cb#diff-14bbcd1fe291a13ef9ceac0a33e03e0dR976 and indeed it was working property earlier on in July.

When I force the image tag to be v0.2.3 (for mpioperator/mpi-operator and mpioperator/kubectl-delivery), the job runs again as expected.

$ oc version
Client Version: 4.4.6
Server Version: 4.4.10
Kubernetes Version: v1.17.1+9d33dd3

Aug 27 '20 15:08 kpouget

Issue-Label Bot is automatically applying the labels:

Label	Probability
kind/bug	0.72

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

Aug 27 '20 15:08 issue-label-bot[bot]

@milkybird98 Could you help take a look?

Aug 27 '20 17:08 terrytangyuan

The issue is (seems to be at least) that in OpenShift, the pods run with user privileges, so /etc/hosts... cannot be modified this way.

Sep 01 '20 12:09 kpouget

A workaround would be commenting out the lines that modify /etc/hosts and then pass the correct args to mpirun manually.

Sep 01 '20 12:09 terrytangyuan

When using openmpi, the lines that modify /etc/hosts are redundant, the orte process manager does not need to resolve the hostnames of mpi nodes. But the current hydra process manager used by mpich and IntelMPI must resolve the hostnames of mpi nodes, they don't pass the ip address.

Sep 01 '20 12:09 milkybird98

The reason why I choose to modify the /etc/hosts file directly instead of using DNS is mainly because the dns server that comes with k8s uses fdqn hostname of pods, which is not convenient to use. In addition, due to the need to obtain the ip address of the pods in running, HostAliases cannot be used to maintain the /etc/hosts file either.

Sep 01 '20 13:09 milkybird98

Yes, same issue with OpenMPI using the latest version. Reverting mpi-operator to v0.2.3 fixes the permission error.

I have a question about using MPICH. My yaml file is shown below. Why all the MPI processes are running on the launcher nodes? I believe what I need to add is to specify MPIDistribution: "MPICH" in order to switch to MPICH.

apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
  name: hello-mpich

spec:
  slotsPerWorker: 8
  MPIDistribution: "MPICH"
  cleanPodPolicy: Running

  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
         spec:
           containers:
           - image: uk.icr.io/hf-roks-uk/ubuntu-18.04-devel-mpich:v0
             name: hello-mpich
             command:
             - mpirun
             - -np
             - "16"
             - -bind-to
             - none
             - -map-by
             - slot
             - -ppn
             - "8"
             - /codes/HelloMPI/hello
             imagePullPolicy: Always
           imagePullSecrets:
           - name: all-icr-io
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: uk.icr.io/hf-roks-uk/ubuntu-18.04-devel-mpich:v0
            name: hello-mpich
            resource:
              limits:
                nvidia.com/gpu: 0
          imagePullSecrets:
          - name: all-icr-io

Sep 22 '20 21:09 hfwen0502

Any update on this?

Nov 03 '21 18:11 ArangoGutierrez

Hi, have you tried version 0.3.0? Note that it might require recreating the Job.

Nov 03 '21 18:11 alculquicondor

for current kube 1.22, I guess I must deploy v1 , but documentation still points to v2beta1 but that doesn't work with kube 1.22 / nor OCP 4.9 that is using 1.22

What's the current, updated, way to deploy the operator?

Nov 03 '21 18:11 ArangoGutierrez

v2beta1 is newer than v1 and it's supported in kubernetes 1.22. The README is up to date.

Nov 03 '21 18:11 alculquicondor

FYI, I got the MPI-Operator v0.3.0 running on OpenShift (4.9), here are some notes about how I proceeded (do not assume it's safe, it's only working):

Deployment of the operator

oc apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.3.0/deploy/v2beta1/mpi-operator.yaml
oc set image -n mpi-operator deployment.apps/mpi-operator mpi-operator=docker.io/mpioperator/mpi-operator:0.3.0

Configuration of the namespace, to allow the default service account to run pods as root:

# oc new-project $MY_PROJECT
# oc adm policy add-scc-to-user privileged -z default # (from $MY_PROJECT namespace)
# oc adm policy add-scc-to-user anyuid -z  default  # (from $MY_PROJECT namespace)

Setup of the container image:

RUN dnf -y -q \
    install sudo pkg-config vim make gdb \
    curl git openssh-clients openssh-server \
    gcc-gfortran gcc-c++ \
    openmpi-devel openmpi \
    iputils \
 && ln -s /usr/lib64/openmpi/bin/orted /usr/bin/orted \
 && ssh-keygen -A \
 && (echo "Host *"; echo "    StrictHostKeyChecking no") >> /etc/ssh/ssh_config.d/no_StrictHostKeyChecking.conf \
 && echo "StrictModes no" >> /etc/ssh/sshd_config \
 && touch /var/log/lastlog \
 && chgrp utmp /var/log/lastlog \
 && chmod 664 /var/log/lastlog

Besides, in v0.3.0, the Launcher Pod starts its life by failing until the worker Pods are up and running (and waiting for incoming SSH connections). It will fail because the <worker-pod>.<service> hostname is unavailable until the worker Pods are running.

At the first glance, this isn't an issue, as the launcher Pod retries forever, and eventually the worker Pods are running and the MPI execution can actually start. However, this makes automation impossible, as we cannot distinguish a genuine failure from a "not-ready" failure.

This init container works around the problem, by preventing the main container from running before the worker Pods can be reached.

The first loop waits for the host list file to be populated (/etc/mpi/discover_hosts.sh). I'm not 100% sure that the exit 1 is mandatory, I thought ConfigMap files would be updated automatically in the Pods, but in my quick tries it didn't happen. To be confirmed. The second loop waits for the ssh connection to the worker Pod to work.

  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          initContainers:
          - name: wait-hostfilename
            image: <any image>
            command:
            - bash
            - -cx
            - "[[ $(cat /etc/mpi/discover_hosts.sh | wc -l) != 1 ]] && (date; echo Ready; cat /etc/mpi/discover_hosts.sh) || (date; echo 'not ready ...'; sleep 10; exit 1) && while read host; do while ! ssh $host echo $host ; do date; echo \"Pod $host is not up ...\"; sleep 10; done; date; echo \"Pod $host is ready\"; done <<< \"$(/etc/mpi/discover_hosts.sh)\""
            volumeMounts:
            - mountPath: /etc/mpi
              name: mpi-job-config
            - mountPath: /root/.ssh
              name: ssh-auth

Jan 12 '22 14:01 kpouget

In addition to what's above, the Worker Pod must run with privileges:

    Worker:
      replicas: {{ .Nproc }}
      template:
        spec:
          containers:
          - name: mpi-worker
            image: {{ .Image }}
            imagePullPolicy: Always

            securityContext: 
              privileged: true

otherwise the ssh connection will fail with this error:

chroot("/var/empty/sshd"): Operation not permitted [preauth]
linux_audit_write_entry failed: Operation not permitted

Jan 13 '22 16:01 kpouget

If you change the sshd port number, you might not need the container to be privileged

Jan 13 '22 16:01 alculquicondor

mpi-operator mpi-operator copied to clipboard

Cannot run MPIJobs on OpenShift: hosts_of_nodes: Cannot open: Permission denied

mpi-operator
mpi-operator copied to clipboard