mpi-operator
mpi-operator copied to clipboard
tensorflow-bencharmark worker Authentication: Permissions 0640 for '/root/.ssh/id_rsa' are too open.
Hi, I'm trying to run a exmaple job in /examples/v1/tensorflow-benchmarks.yaml. I have below error for the launcher pod. Could you please help to have a look.
[epwiann@node-10-210-152-99 kubeflow]$ k turing001 -n ping-wang logs tensorflow-benchmarks-launcher-62t7l
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/tensorflow-benchmarks-launcher-62t7l. Please use `kubectl.kubernetes.io/default-container` instead
Warning: Permanently added 'tensorflow-benchmarks-worker-1.tensorflow-benchmarks-worker,10.42.138.147' (ECDSA) to the list of known hosts.
Warning: Permanently added 'tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker,10.42.91.237' (ECDSA) to the list of known hosts.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0640 for '/root/.ssh/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "/root/.ssh/id_rsa": bad permissions
Permission denied, please try again.
Permission denied, please try again.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0640 for '/root/.ssh/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "/root/.ssh/id_rsa": bad permissions
[email protected]: Permission denied (publickey,password).
Permission denied, please try again.
Permission denied, please try again.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
[email protected]: Permission denied (publickey,password).
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: tensorflow-benchmarks-launcher
target node: tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
$ kubectl get po -n mynamespace |grep benchmar
tensorflow-benchmarks-launcher-62t7l 1/2 CrashLoopBackOff 5 4m34s
tensorflow-benchmarks-worker-0 2/2 Running 0 4m35s
tensorflow-benchmarks-worker-1 2/2 Running 0 4m35s
My mpioperator/tensorflow-benchmarks is below, "2 weeks ago" much new.
# docker images|grep tensorflow
mpioperator/tensorflow-benchmarks latest 840932631c4c 2 weeks ago 9.72GB
And I've checked that in this tensorflow-benchmarks image, It have the following configuration: https://github.com/kubeflow/mpi-operator/blob/master/examples/tensorflow-benchmarks/Dockerfile#L3-L7
$ k turing001 -n ping-wang exec -it tensorflow-benchmarks-worker-0 sh
# ls -al
total 12
drwxr-sr-x 2 root 1337 100 Sep 29 06:27 .
drwxrwsrwt 3 root 1337 140 Sep 29 06:27 ..
-rw-r----- 1 root 1337 253 Sep 29 06:27 authorized_keys
-rw-r----- 1 root 1337 365 Sep 29 06:27 id_rsa
-rw-r----- 1 root 1337 253 Sep 29 06:27 id_rsa.pub
#
#
#
#
#cat /etc/ssh/ssh_config
# This is the ssh client system-wide configuration file. See
# ssh_config(5) for more information. This file provides defaults for
# users, and the values can be changed in per-user configuration files
# or on the command line.
# Configuration data is parsed as follows:
# 1. command line options
# 2. user-specific file
# 3. system-wide file
# Any configuration value is only changed the first time it is set.
# Thus, host-specific definitions should be at the beginning of the
# configuration file, and defaults at the end.
# Site-wide defaults for some commonly used options. For a comprehensive
# list of available options, their meanings and defaults, please see the
# ssh_config(5) man page.
Host *
# ForwardAgent no
# ForwardX11 no
# ForwardX11Trusted yes
# PasswordAuthentication yes
# HostbasedAuthentication no
# GSSAPIAuthentication no
# GSSAPIDelegateCredentials no
# GSSAPIKeyExchange no
# GSSAPITrustDNS no
# BatchMode no
# CheckHostIP yes
# AddressFamily any
# ConnectTimeout 0
# IdentityFile ~/.ssh/id_rsa
# IdentityFile ~/.ssh/id_dsa
# IdentityFile ~/.ssh/id_ecdsa
# IdentityFile ~/.ssh/id_ed25519
# Port 22
# Protocol 2
# Ciphers aes128-ctr,aes192-ctr,aes256-ctr,aes128-cbc,3des-cbc
# MACs hmac-md5,hmac-sha1,[email protected]
# EscapeChar ~
# Tunnel no
# TunnelDevice any:any
# PermitLocalCommand no
# VisualHostKey no
# ProxyCommand ssh -q -W %h:%p gateway.example.com
# RekeyLimit 1G 1h
SendEnv LANG LC_*
HashKnownHosts yes
GSSAPIAuthentication yes
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
This log line:
Permissions 0640 for '/root/.ssh/id_rsa' are too open.
Implies that somehow this didn't run:
https://github.com/kubeflow/mpi-operator/blob/c5c0c3ef99ec9de948600766988ea7134d3d2af6/v2/pkg/controller/mpi_job_controller.go#L1526
Can you show the output of kubectl get -o yaml pod tensorflow-benchmarks-launcher-62t7l
? I wonder if the volumes were setup correctly.
Hi @alculquicondor , thanks for taking time on this issue.
I think I found that maybe related to the istio sidercar. If I deployed /examples/v1/tensorflow-benchmarks.yaml in a namespace without sidecar.istio.io/inject: "true". Then the Launcher pod running without problem.
If deployed in a namespace with auto istio sider car injection. Then Launcher pod with this error.
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/podIP: 10.42.198.115/32
cni.projectcalico.org/podIPs: 10.42.198.115/32
kubectl.kubernetes.io/default-logs-container: tensorflow-benchmarks
prometheus.io/path: /stats/prometheus
prometheus.io/port: "15020"
prometheus.io/scrape: "true"
sidecar.istio.io/status: '{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null}'
creationTimestamp: "2021-09-30T01:05:12Z"
generateName: tensorflow-benchmarks-launcher-
labels:
controller-uid: 83a477d9-dab2-4cc9-a91b-e5b4ac946cc7
istio.io/rev: default
job-name: tensorflow-benchmarks-launcher
security.istio.io/tlsMode: istio
service.istio.io/canonical-name: tensorflow-benchmarks-launcher
service.istio.io/canonical-revision: latest
training.kubeflow.org/job-name: tensorflow-benchmarks
training.kubeflow.org/job-role: launcher
training.kubeflow.org/operator-name: mpi-operator
name: tensorflow-benchmarks-launcher-5jb5n
namespace: ping-wang
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: tensorflow-benchmarks-launcher
uid: 83a477d9-dab2-4cc9-a91b-e5b4ac946cc7
resourceVersion: "249187576"
selfLink: /api/v1/namespaces/ping-wang/pods/tensorflow-benchmarks-launcher-5jb5n
uid: 007af6b3-fda2-485d-8f83-f0d829732764
spec:
containers:
- command:
- mpirun
- --allow-run-as-root
- -np
- "2"
- -bind-to
- none
- -map-by
- slot
- -x
- NCCL_DEBUG=INFO
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
- --model=resnet101
- --batch_size=64
- --variable_update=horovod
env:
- name: K_MPI_JOB_ROLE
value: launcher
- name: OMPI_MCA_orte_keep_fqdn_hostnames
value: "true"
- name: OMPI_MCA_orte_default_hostfile
value: /etc/mpi/hostfile
- name: OMPI_MCA_plm_rsh_args
value: -o ConnectionAttempts=10
- name: OMPI_MCA_orte_set_default_slots
value: "1"
- name: NVIDIA_VISIBLE_DEVICES
- name: NVIDIA_DRIVER_CAPABILITIES
image: mpioperator/tensorflow-benchmarks:latest
imagePullPolicy: Always
name: tensorflow-benchmarks
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /root/.ssh
name: ssh-auth
- mountPath: /etc/mpi
name: mpi-job-config
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jsmkb
readOnly: true
- args:
- proxy
- sidecar
- --domain
- $(POD_NAMESPACE).svc.cluster.local
- --serviceCluster
- tensorflow-benchmarks-launcher.ping-wang
- --proxyLogLevel=warning
- --proxyComponentLogLevel=misc:error
- --log_output_level=default:info
- --concurrency
- "2"
env:
- name: JWT_POLICY
value: third-party-jwt
- name: PILOT_CERT_PROVIDER
value: istiod
- name: CA_ADDR
value: istiod.istio-system.svc:15012
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: INSTANCE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: SERVICE_ACCOUNT
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.serviceAccountName
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: CANONICAL_SERVICE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['service.istio.io/canonical-name']
- name: CANONICAL_REVISION
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['service.istio.io/canonical-revision']
- name: PROXY_CONFIG
value: |
{"tracing":{}}
- name: ISTIO_META_POD_PORTS
value: |-
[
]
- name: ISTIO_META_APP_CONTAINERS
value: tensorflow-benchmarks
- name: ISTIO_META_CLUSTER_ID
value: Kubernetes
- name: ISTIO_META_INTERCEPTION_MODE
value: REDIRECT
- name: ISTIO_META_WORKLOAD_NAME
value: tensorflow-benchmarks-launcher
- name: ISTIO_META_OWNER
value: kubernetes://apis/batch/v1/namespaces/ping-wang/jobs/tensorflow-benchmarks-launcher
- name: ISTIO_META_MESH_ID
value: cluster.local
- name: TRUST_DOMAIN
value: cluster.local
image: gcr.io/istio-release/proxyv2:1.9.6
imagePullPolicy: Always
name: istio-proxy
ports:
- containerPort: 15090
name: http-envoy-prom
protocol: TCP
readinessProbe:
failureThreshold: 30
httpGet:
path: /healthz/ready
port: 15021
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 3
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 10m
memory: 40Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsGroup: 1337
runAsNonRoot: true
runAsUser: 1337
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/istio
name: istiod-ca-cert
- mountPath: /var/lib/istio/data
name: istio-data
- mountPath: /etc/istio/proxy
name: istio-envoy
- mountPath: /var/run/secrets/tokens
name: istio-token
- mountPath: /etc/istio/pod
name: istio-podinfo
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jsmkb
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: tensorflow-benchmarks-launcher
initContainers:
- args:
- istio-iptables
- -p
- "15001"
- -z
- "15006"
- -u
- "1337"
- -m
- REDIRECT
- -i
- '*'
- -x
- ""
- -b
- '*'
- -d
- 15090,15021,15020
image: gcr.io/istio-release/proxyv2:1.9.6
imagePullPolicy: Always
name: istio-init
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 10m
memory: 40Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_ADMIN
- NET_RAW
drop:
- ALL
privileged: false
readOnlyRootFilesystem: false
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jsmkb
readOnly: true
nodeName: node-10-120-220-137
priority: 0
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext:
fsGroup: 1337
serviceAccount: default
serviceAccountName: default
subdomain: tensorflow-benchmarks-worker
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir:
medium: Memory
name: istio-envoy
- emptyDir: {}
name: istio-data
- downwardAPI:
defaultMode: 420
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.labels
path: labels
- fieldRef:
apiVersion: v1
fieldPath: metadata.annotations
path: annotations
- path: cpu-limit
resourceFieldRef:
containerName: istio-proxy
divisor: 1m
resource: limits.cpu
- path: cpu-request
resourceFieldRef:
containerName: istio-proxy
divisor: 1m
resource: requests.cpu
name: istio-podinfo
- name: istio-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: istio-ca
expirationSeconds: 43200
path: istio-token
- configMap:
defaultMode: 420
name: istio-ca-root-cert
name: istiod-ca-cert
- name: ssh-auth
secret:
defaultMode: 384
items:
- key: ssh-privatekey
path: id_rsa
- key: ssh-publickey
path: id_rsa.pub
- key: ssh-publickey
path: authorized_keys
secretName: tensorflow-benchmarks-ssh
- configMap:
defaultMode: 420
items:
- key: hostfile
mode: 292
path: hostfile
- key: discover_hosts.sh
mode: 365
path: discover_hosts.sh
name: tensorflow-benchmarks-config
name: mpi-job-config
- name: default-token-jsmkb
secret:
defaultMode: 420
secretName: default-token-jsmkb
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2021-09-30T01:05:16Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2021-09-30T01:05:13Z"
message: 'containers with unready status: [tensorflow-benchmarks]'
reason: ContainersNotReady
status: "False"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2021-09-30T01:05:13Z"
message: 'containers with unready status: [tensorflow-benchmarks]'
reason: ContainersNotReady
status: "False"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2021-09-30T01:05:13Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://ab162a1bfd50198374be638378a697307e3408662795d42c8fd1fbbd6fb828d9
image: docker-ecr001.rnd.gic.ericsson.se/istio/proxyv2:1.9.6
imageID: docker-pullable://docker-ecr001.rnd.gic.ericsson.se/istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
lastState: {}
name: istio-proxy
ready: true
restartCount: 0
started: true
state:
running:
startedAt: "2021-09-30T01:05:17Z"
- containerID: docker://44003d40af0b468db0d35cee8986d39197efe592328c8c885f7b1e6d2b3b91bc
image: mpioperator/tensorflow-benchmarks:latest
imageID: docker-pullable://mpioperator/tensorflow-benchmarks@sha256:476eb9df7a348a722f3c4e5e15e6c6f3fe9ed29749e8be98ac9447df3b1b5a54
lastState:
terminated:
containerID: docker://44003d40af0b468db0d35cee8986d39197efe592328c8c885f7b1e6d2b3b91bc
exitCode: 255
finishedAt: "2021-09-30T01:06:43Z"
reason: Error
startedAt: "2021-09-30T01:06:43Z"
name: tensorflow-benchmarks
ready: false
restartCount: 4
started: false
state:
waiting:
message: back-off 1m20s restarting failed container=tensorflow-benchmarks
pod=tensorflow-benchmarks-launcher-5jb5n_ping-wang(007af6b3-fda2-485d-8f83-f0d829732764)
reason: CrashLoopBackOff
hostIP: 10.120.220.137
initContainerStatuses:
- containerID: docker://8cf9f8f27f0df9f0f704b16b735eabef6ddf24f7dd8f8b2eaa824986f5d46fba
image: docker-ecr001.rnd.gic.ericsson.se/istio/proxyv2:1.9.6
imageID: docker-pullable://docker-ecr001.rnd.gic.ericsson.se/istio/proxyv2@sha256:87a9db561d2ef628deea7a4cbd0adf008a2f64355a2796e3b840d445b7e9cd3e
lastState: {}
name: istio-init
ready: true
restartCount: 0
state:
terminated:
containerID: docker://8cf9f8f27f0df9f0f704b16b735eabef6ddf24f7dd8f8b2eaa824986f5d46fba
exitCode: 0
finishedAt: "2021-09-30T01:05:15Z"
reason: Completed
startedAt: "2021-09-30T01:05:15Z"
phase: Running
podIP: 10.42.198.115
podIPs:
- ip: 10.42.198.115
qosClass: Burstable
startTime: "2021-09-30T01:05:13Z"
I wonder if this has something to do with the security context:
securityContext:
fsGroup: 1337
Perhaps that's changing the permissions of the volume mounted in /root/.ssh
.
Can you try running a different sample? https://github.com/kubeflow/mpi-operator/blob/master/examples/pi/pi.yaml
This sample run as non-root. That's a good thing in general. Perhaps there will be no way to support running as root when using istio.
I haven't tested running the tensorflow-benchmarks as non root. Maybe you will need to add the .ssh_config
file like we do here https://github.com/kubeflow/mpi-operator/blob/master/examples/base/Dockerfile#L28
The root cause seems to be the fsGroup
indeed. Here is the upstream issue: kubernetes/kubernetes#57923
And there is this proposal for a fix that hasn't started yet kubernetes/enhancements#2605
@celiawa, could you try the other sample?
/retitle Can't run as root when using istio
Hi @alculquicondor , sorry for the later response, as I was on holiday last few days. I tried this sample https://github.com/kubeflow/mpi-operator/blob/master/examples/pi/pi.yaml in the istio injection namespace. The laucher now failed
$ k turing001 get po -n mynamespace
NAME READY STATUS RESTARTS AGE
pi-launcher-lb8bl 1/2 CrashLoopBackOff 5 4m59s
pi-worker-0 2/2 Running 0 5m
pi-worker-1 2/2 Running 0 5m
$ k turing001 -n mynamespace logs pi-launcher-lb8bl
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/pi-launcher-lb8bl. Please use `kubectl.kubernetes.io/default-container` instead
Warning: Permanently added 'pi-worker-0.pi-worker,10.42.116.83' (ECDSA) to the list of known hosts.
Warning: Permanently added 'pi-worker-1.pi-worker,10.42.88.141' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: pi-launcher
target node: pi-worker-1.pi-worker
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
Can you provide more information? Like the worker logs and the yamls for launcher and workers?
Also consider adding the environment variable OMPI_MCA_orte_debug
with value true
to the container
Hi, Please refer below info. Logs with OMPI_MCA_orte_debug setting to true.
yamls for launcher
$ k turing001 -n mynamesoace get po pi-launcher-n5p27 -oyaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/podIP: 10.42.198.107/32
cni.projectcalico.org/podIPs: 10.42.198.107/32
kubectl.kubernetes.io/default-logs-container: mpi-launcher
prometheus.io/path: /stats/prometheus
prometheus.io/port: "15020"
prometheus.io/scrape: "true"
sidecar.istio.io/status: '{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null}'
creationTimestamp: "2021-10-09T05:32:14Z"
generateName: pi-launcher-
labels:
controller-uid: 4230ce63-b837-4fb0-a852-2bd77c90f080
istio.io/rev: default
job-name: pi-launcher
security.istio.io/tlsMode: istio
service.istio.io/canonical-name: pi-launcher
service.istio.io/canonical-revision: latest
training.kubeflow.org/job-name: pi
training.kubeflow.org/job-role: launcher
training.kubeflow.org/operator-name: mpi-operator
name: pi-launcher-n5p27
namespace: ping-wang
ownerReferences:
- apiVersion: batch/v1
blockOwnerDeletion: true
controller: true
kind: Job
name: pi-launcher
uid: 4230ce63-b837-4fb0-a852-2bd77c90f080
resourceVersion: "270665573"
selfLink: /api/v1/namespaces/ping-wang/pods/pi-launcher-n5p27
uid: f5373e7f-4a2e-4f23-9005-f1fdcb80aa86
spec:
containers:
- args:
- -n
- "2"
- /home/mpiuser/pi
command:
- mpirun
env:
- name: OMPI_MCA_orte_debug
value: "true"
- name: K_MPI_JOB_ROLE
value: launcher
- name: OMPI_MCA_orte_keep_fqdn_hostnames
value: "true"
- name: OMPI_MCA_orte_default_hostfile
value: /etc/mpi/hostfile
- name: OMPI_MCA_plm_rsh_args
value: -o ConnectionAttempts=10
- name: OMPI_MCA_orte_set_default_slots
value: "1"
- name: NVIDIA_VISIBLE_DEVICES
- name: NVIDIA_DRIVER_CAPABILITIES
image: mpioperator/mpi-pi
imagePullPolicy: Always
name: mpi-launcher
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: "1"
memory: 1Gi
securityContext:
runAsUser: 1000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /home/mpiuser/.ssh
name: ssh-auth
- mountPath: /etc/mpi
name: mpi-job-config
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jsmkb
readOnly: true
- args:
- proxy
- sidecar
- --domain
- $(POD_NAMESPACE).svc.cluster.local
- --serviceCluster
- pi-launcher.ping-wang
- --proxyLogLevel=warning
- --proxyComponentLogLevel=misc:error
- --log_output_level=default:info
- --concurrency
- "2"
env:
- name: JWT_POLICY
value: third-party-jwt
- name: PILOT_CERT_PROVIDER
value: istiod
- name: CA_ADDR
value: istiod.istio-system.svc:15012
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: INSTANCE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: SERVICE_ACCOUNT
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.serviceAccountName
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: CANONICAL_SERVICE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['service.istio.io/canonical-name']
- name: CANONICAL_REVISION
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['service.istio.io/canonical-revision']
- name: PROXY_CONFIG
value: |
{"tracing":{}}
- name: ISTIO_META_POD_PORTS
value: |-
[
]
- name: ISTIO_META_APP_CONTAINERS
value: mpi-launcher
- name: ISTIO_META_CLUSTER_ID
value: Kubernetes
- name: ISTIO_META_INTERCEPTION_MODE
value: REDIRECT
- name: ISTIO_META_WORKLOAD_NAME
value: pi-launcher
- name: ISTIO_META_OWNER
value: kubernetes://apis/batch/v1/namespaces/ping-wang/jobs/pi-launcher
- name: ISTIO_META_MESH_ID
value: cluster.local
- name: TRUST_DOMAIN
value: cluster.local
image: gcr.io/istio-release/proxyv2:1.9.6
imagePullPolicy: Always
name: istio-proxy
ports:
- containerPort: 15090
name: http-envoy-prom
protocol: TCP
readinessProbe:
failureThreshold: 30
httpGet:
path: /healthz/ready
port: 15021
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 3
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 10m
memory: 40Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsGroup: 1337
runAsNonRoot: true
runAsUser: 1337
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/istio
name: istiod-ca-cert
- mountPath: /var/lib/istio/data
name: istio-data
- mountPath: /etc/istio/proxy
name: istio-envoy
- mountPath: /var/run/secrets/tokens
name: istio-token
- mountPath: /etc/istio/pod
name: istio-podinfo
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jsmkb
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: pi-launcher
initContainers:
- args:
- istio-iptables
- -p
- "15001"
- -z
- "15006"
- -u
- "1337"
- -m
- REDIRECT
- -i
- '*'
- -x
- ""
- -b
- '*'
- -d
- 15090,15021,15020
image: gcr.io/istio-release/proxyv2:1.9.6
imagePullPolicy: Always
name: istio-init
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 10m
memory: 40Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_ADMIN
- NET_RAW
drop:
- ALL
privileged: false
readOnlyRootFilesystem: false
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jsmkb
readOnly: true
nodeName: node-10-120-220-137
priority: 0
restartPolicy: OnFailure
schedulerName: default-scheduler
securityContext:
fsGroup: 1337
serviceAccount: default
serviceAccountName: default
subdomain: pi-worker
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir:
medium: Memory
name: istio-envoy
- emptyDir: {}
name: istio-data
- downwardAPI:
defaultMode: 420
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.labels
path: labels
- fieldRef:
apiVersion: v1
fieldPath: metadata.annotations
path: annotations
- path: cpu-limit
resourceFieldRef:
containerName: istio-proxy
divisor: 1m
resource: limits.cpu
- path: cpu-request
resourceFieldRef:
containerName: istio-proxy
divisor: 1m
resource: requests.cpu
name: istio-podinfo
- name: istio-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: istio-ca
expirationSeconds: 43200
path: istio-token
- configMap:
defaultMode: 420
name: istio-ca-root-cert
name: istiod-ca-cert
- name: ssh-auth
secret:
defaultMode: 420
items:
- key: ssh-privatekey
path: id_rsa
- key: ssh-publickey
path: id_rsa.pub
- key: ssh-publickey
path: authorized_keys
secretName: pi-ssh
- configMap:
defaultMode: 420
items:
- key: hostfile
mode: 292
path: hostfile
- key: discover_hosts.sh
mode: 365
path: discover_hosts.sh
name: pi-config
name: mpi-job-config
- name: default-token-jsmkb
secret:
defaultMode: 420
secretName: default-token-jsmkb
yamls for worker
$ k turing001 -n mynamespace get po pi-worker-0 -oyaml
apiVersion: v1
kind: Pod
metadata:
annotations:
cni.projectcalico.org/podIP: 10.42.116.68/32
cni.projectcalico.org/podIPs: 10.42.116.68/32
kubectl.kubernetes.io/default-logs-container: mpi-worker
prometheus.io/path: /stats/prometheus
prometheus.io/port: "15020"
prometheus.io/scrape: "true"
sidecar.istio.io/status: '{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-ca-cert"],"imagePullSecrets":null}'
creationTimestamp: "2021-10-09T05:40:51Z"
labels:
istio.io/rev: default
security.istio.io/tlsMode: istio
service.istio.io/canonical-name: pi-worker-0
service.istio.io/canonical-revision: latest
training.kubeflow.org/job-name: pi
training.kubeflow.org/job-role: worker
training.kubeflow.org/operator-name: mpi-operator
training.kubeflow.org/replica-index: "0"
name: pi-worker-0
namespace: ping-wang
ownerReferences:
- apiVersion: kubeflow.org/v2beta1
blockOwnerDeletion: true
controller: true
kind: MPIJob
name: pi
uid: 88b20aa4-49ce-4625-985a-4c5cf2dabb91
resourceVersion: "270678724"
selfLink: /api/v1/namespaces/ping-wang/pods/pi-worker-0
uid: cd1f2e0c-7208-46aa-9719-3197a94b9ae6
spec:
containers:
- args:
- -De
- -f
- /home/mpiuser/.sshd_config
command:
- /usr/sbin/sshd
env:
- name: OMPI_MCA_orte_debug
value: "true"
- name: K_MPI_JOB_ROLE
value: worker
image: mpioperator/mpi-pi
imagePullPolicy: Always
name: mpi-worker
resources:
limits:
cpu: "1"
memory: 1Gi
requests:
cpu: "1"
memory: 1Gi
securityContext:
runAsUser: 1000
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /home/mpiuser/.ssh
name: ssh-auth
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jsmkb
readOnly: true
- args:
- proxy
- sidecar
- --domain
- $(POD_NAMESPACE).svc.cluster.local
- --serviceCluster
- pi-worker-0.ping-wang
- --proxyLogLevel=warning
- --proxyComponentLogLevel=misc:error
- --log_output_level=default:info
- --concurrency
- "2"
env:
- name: JWT_POLICY
value: third-party-jwt
- name: PILOT_CERT_PROVIDER
value: istiod
- name: CA_ADDR
value: istiod.istio-system.svc:15012
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: INSTANCE_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
- name: SERVICE_ACCOUNT
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.serviceAccountName
- name: HOST_IP
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.hostIP
- name: CANONICAL_SERVICE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['service.istio.io/canonical-name']
- name: CANONICAL_REVISION
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.labels['service.istio.io/canonical-revision']
- name: PROXY_CONFIG
value: |
{"tracing":{}}
- name: ISTIO_META_POD_PORTS
value: |-
[
]
- name: ISTIO_META_APP_CONTAINERS
value: mpi-worker
- name: ISTIO_META_CLUSTER_ID
value: Kubernetes
- name: ISTIO_META_INTERCEPTION_MODE
value: REDIRECT
- name: ISTIO_META_WORKLOAD_NAME
value: pi-worker-0
- name: ISTIO_META_OWNER
value: kubernetes://apis/v1/namespaces/ping-wang/pods/pi-worker-0
- name: ISTIO_META_MESH_ID
value: cluster.local
- name: TRUST_DOMAIN
value: cluster.local
image: gcr.io/istio-release/proxyv2:1.9.6
imagePullPolicy: Always
name: istio-proxy
ports:
- containerPort: 15090
name: http-envoy-prom
protocol: TCP
readinessProbe:
failureThreshold: 30
httpGet:
path: /healthz/ready
port: 15021
scheme: HTTP
initialDelaySeconds: 1
periodSeconds: 2
successThreshold: 1
timeoutSeconds: 3
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 10m
memory: 40Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsGroup: 1337
runAsNonRoot: true
runAsUser: 1337
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/istio
name: istiod-ca-cert
- mountPath: /var/lib/istio/data
name: istio-data
- mountPath: /etc/istio/proxy
name: istio-envoy
- mountPath: /var/run/secrets/tokens
name: istio-token
- mountPath: /etc/istio/pod
name: istio-podinfo
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jsmkb
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: pi-worker-0
initContainers:
- args:
- istio-iptables
- -p
- "15001"
- -z
- "15006"
- -u
- "1337"
- -m
- REDIRECT
- -i
- '*'
- -x
- ""
- -b
- '*'
- -d
- 15090,15021,15020
image: gcr.io/istio-release/proxyv2:1.9.6
imagePullPolicy: Always
name: istio-init
resources:
limits:
cpu: "2"
memory: 1Gi
requests:
cpu: 10m
memory: 40Mi
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_ADMIN
- NET_RAW
drop:
- ALL
privileged: false
readOnlyRootFilesystem: false
runAsGroup: 0
runAsNonRoot: false
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-jsmkb
readOnly: true
nodeName: node-10-120-220-132
priority: 0
restartPolicy: Never
schedulerName: default-scheduler
securityContext:
fsGroup: 1337
serviceAccount: default
serviceAccountName: default
subdomain: pi-worker
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- emptyDir:
medium: Memory
name: istio-envoy
- emptyDir: {}
name: istio-data
- downwardAPI:
defaultMode: 420
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.labels
path: labels
- fieldRef:
apiVersion: v1
fieldPath: metadata.annotations
path: annotations
- path: cpu-limit
resourceFieldRef:
containerName: istio-proxy
divisor: 1m
resource: limits.cpu
- path: cpu-request
resourceFieldRef:
containerName: istio-proxy
divisor: 1m
resource: requests.cpu
name: istio-podinfo
- name: istio-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: istio-ca
expirationSeconds: 43200
path: istio-token
- configMap:
defaultMode: 420
name: istio-ca-root-cert
name: istiod-ca-cert
- name: ssh-auth
secret:
defaultMode: 420
items:
- key: ssh-privatekey
path: id_rsa
- key: ssh-publickey
path: id_rsa.pub
- key: ssh-publickey
path: authorized_keys
secretName: pi-ssh
- name: default-token-jsmkb
secret:
defaultMode: 420
secretName: default-token-jsmkb
Launcher logs
k turing001 -n mynamespace logs pi-launcher-7b75v
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/pi-launcher-7b75v. Please use `kubectl.kubernetes.io/default-container` instead
[pi-launcher:00001] procdir: /tmp/ompi.pi-launcher.1000/pid.1/0/0
[pi-launcher:00001] jobdir: /tmp/ompi.pi-launcher.1000/pid.1/0
[pi-launcher:00001] top: /tmp/ompi.pi-launcher.1000/pid.1
[pi-launcher:00001] top: /tmp/ompi.pi-launcher.1000
[pi-launcher:00001] tmp: /tmp
[pi-launcher:00001] sess_dir_cleanup: job session dir does not exist
[pi-launcher:00001] sess_dir_cleanup: top session dir does not exist
[pi-launcher:00001] procdir: /tmp/ompi.pi-launcher.1000/pid.1/0/0
[pi-launcher:00001] jobdir: /tmp/ompi.pi-launcher.1000/pid.1/0
[pi-launcher:00001] top: /tmp/ompi.pi-launcher.1000/pid.1
[pi-launcher:00001] top: /tmp/ompi.pi-launcher.1000
[pi-launcher:00001] tmp: /tmp
Warning: Permanently added 'pi-worker-1.pi-worker,10.42.130.76' (ECDSA) to the list of known hosts.
Warning: Permanently added 'pi-worker-0.pi-worker,10.42.116.68' (ECDSA) to the list of known hosts.
[pi-worker-1:00026] procdir: /tmp/ompi.pi-worker-1.1000/jf.49480/0/2
[pi-worker-1:00026] jobdir: /tmp/ompi.pi-worker-1.1000/jf.49480/0
[pi-worker-1:00026] top: /tmp/ompi.pi-worker-1.1000/jf.49480
[pi-worker-1:00026] top: /tmp/ompi.pi-worker-1.1000
[pi-worker-1:00026] tmp: /tmp
[pi-worker-1:00026] sess_dir_cleanup: job session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: top session dir does not exist
[pi-worker-1:00026] procdir: /tmp/ompi.pi-worker-1.1000/jf.49480/0/2
[pi-worker-1:00026] jobdir: /tmp/ompi.pi-worker-1.1000/jf.49480/0
[pi-worker-1:00026] top: /tmp/ompi.pi-worker-1.1000/jf.49480
[pi-worker-1:00026] top: /tmp/ompi.pi-worker-1.1000
[pi-worker-1:00026] tmp: /tmp
[pi-worker-0:00027] procdir: /tmp/ompi.pi-worker-0.1000/jf.49480/0/1
[pi-worker-0:00027] jobdir: /tmp/ompi.pi-worker-0.1000/jf.49480/0
[pi-worker-0:00027] top: /tmp/ompi.pi-worker-0.1000/jf.49480
[pi-worker-0:00027] top: /tmp/ompi.pi-worker-0.1000
[pi-worker-0:00027] tmp: /tmp
[pi-worker-0:00027] sess_dir_cleanup: job session dir does not exist
[pi-worker-0:00027] sess_dir_cleanup: top session dir does not exist
[pi-worker-0:00027] procdir: /tmp/ompi.pi-worker-0.1000/jf.49480/0/1
[pi-worker-0:00027] jobdir: /tmp/ompi.pi-worker-0.1000/jf.49480/0
[pi-worker-0:00027] top: /tmp/ompi.pi-worker-0.1000/jf.49480
[pi-worker-0:00027] top: /tmp/ompi.pi-worker-0.1000
[pi-worker-0:00027] tmp: /tmp
[pi-worker-1:00026] sess_dir_finalize: proc session dir does not exist
[pi-worker-1:00026] sess_dir_finalize: job session dir does not exist
[pi-worker-1:00026] sess_dir_finalize: jobfam session dir does not exist
[pi-worker-1:00026] sess_dir_finalize: jobfam session dir does not exist
[pi-worker-1:00026] sess_dir_finalize: top session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: job session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: top session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: job session dir does not exist
[pi-worker-1:00026] sess_dir_cleanup: top session dir does not exist
exiting with status 1
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
-
not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
-
lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
-
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
-
compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
-
an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: pi-launcher
target node: pi-worker-0.pi-worker
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
[pi-launcher:00001] Job UNKNOWN has launched
[pi-launcher:00001] [[49480,0],0] Releasing job data for [49480,1]
[pi-launcher:00001] sess_dir_finalize: proc session dir does not exist
[pi-launcher:00001] sess_dir_finalize: job session dir does not exist
[pi-launcher:00001] sess_dir_finalize: jobfam session dir does not exist
[pi-launcher:00001] sess_dir_finalize: jobfam session dir does not exist
[pi-launcher:00001] sess_dir_finalize: top session dir does not exist
[pi-launcher:00001] sess_dir_cleanup: job session dir does not exist
[pi-launcher:00001] sess_dir_cleanup: top session dir does not exist
[pi-launcher:00001] [[49480,0],0] Releasing job data for [49480,0]
[pi-launcher:00001] sess_dir_cleanup: job session dir does not exist
[pi-launcher:00001] sess_dir_cleanup: top session dir does not exist
exiting with status 1
Worker logs
$ k turing001 -n mynamespace logs pi-worker-0
Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/pi-worker-0. Please use `kubectl.kubernetes.io/default-container` instead
Server listening on 0.0.0.0 port 22.
Server listening on :: port 22.
Accepted publickey for mpiuser from 10.42.198.120 port 33588 ssh2: ECDSA SHA256:wvQe8CVOl3BQ+7ig4VHapc8rLl6GNIWCiZKxJkFvPNU
Received disconnect from 10.42.198.120 port 33588:11: disconnected by user
Disconnected from user mpiuser 10.42.198.120 port 33588
Accepted publickey for mpiuser from 10.42.198.120 port 35306 ssh2: ECDSA SHA256:wvQe8CVOl3BQ+7ig4VHapc8rLl6GNIWCiZKxJkFvPNU
Received disconnect from 10.42.198.120 port 35306:11: disconnected by user
Disconnected from user mpiuser 10.42.198.120 port 35306
Accepted publickey for mpiuser from 10.42.198.120 port 38536 ssh2: ECDSA SHA256:wvQe8CVOl3BQ+7ig4VHapc8rLl6GNIWCiZKxJkFvPNU
Accepted publickey for mpiuser from 10.42.198.120 port 43876 ssh2: ECDSA SHA256:wvQe8CVOl3BQ+7ig4VHapc8rLl6GNIWCiZKxJkFvPNU
BTW, the sample https://github.com/kubeflow/mpi-operator/blob/master/examples/pi/pi.yaml worked well in the namespace without istio injection.
From the worker logs, it seems that:
- DNS resolution is working
- SSH is working: no file permission or handshake errors.
Maybe the orte protocol is failing to establish over an istio service? Perhaps TCP needs to be authorized or something. I'm not at all familiar with istio.
@xhejtman you were also running over istio. Did you find issues similar to this? cc @ahg-g for ideas around istio networking.
@xhejtman you were also running over istio. Did you find issues similar to this?
no, but mainly I do not use run-as-root, I am avoiding it as much as possible. The istio had an issue with initcontainers which is not the case here. But I can give it a try on our cluster.
@celiawa's latest attempt (https://github.com/kubeflow/mpi-operator/issues/429#issuecomment-939233000) is running as non-root but still fails. So I suspect that there is a different configuration in the istio proxies that is preventing the communication over the orte protocol to happen.
@alculquicondor I can confirm that tensorflow benchmark does not work, if run in istio enabled namespace. I ended with: ssh_exchange_identification: Connection closed by remote host
It seems that istio proxy is causing this, as I saw that it changes peer IPs.
Actually, the problem is, that istio creates an HTTP proxy:
telnet tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker 2222
Trying 10.42.5.15...
Connected to tensorflow-benchmarks-worker-0.tensorflow-benchmarks-worker.xhejtman.svc.cluster.local.
Escape character is '^]'.
GET / HTTP/1.0
HTTP/1.1 426 Upgrade Required
date: Tue, 12 Oct 2021 21:07:45 GMT
server: istio-envoy
connection: close
content-length: 0
Connection closed by foreign host.
which obviously will never work with ssh. Not sure, how to instrument istio to create generic tcp proxy, not http.
That makes sense. Although in the case in this comment https://github.com/kubeflow/mpi-operator/issues/429#issuecomment-939233000, it looks like SSH works (because of the line Accepted publickey for mpiuser from 10.42.198.120 port 33588 ssh2:
).
But then it fails afterwards, maybe the orte protocol.
Although, @xhejtman does the mpioperator/mpi-pi
work for you under istio or do you see the same problem? We did some changes there according to the debugging you provided. Or was that just about the security policy?
If no images work, I feel like this is a problem of misconfiguration of istio, outside of our scope.
But another question is: why do you want istio for HPC? It would just slow down the job.
But another question is: why do you want istio for HPC? It would just slow down the job.
just note here, well, this is a bit catch 22, as the mpi-operator is used in kubeflow (and also hosted by kubeflow). Kubeflow itself requires istio installed. It does not work without istio (at least to my knowledge). So this is the reason many of us use istio.
The first problem with mpi-pi was that it used init container that accessed network which does not work in istio enabled namespace. This is fixed in mpi-operator already. But there is a new problem with istio-http-proxy. Maybe some proper annotation can fix it? Maybe traffic.sidecar.istio.io/excludeInboundPorts: [2222]
but maybe also excludeOutboundPorts is needed as well? Will try to debug it a bit more. I tried to set port protocol to TCP explicitly but no luck.
@terrytangyuan is it possible to install kubeflow without istio? Is that something documented?
@alculquicondor I think other users have reported istio issue before. I'd search for kubeflow issues instead. Most of the existing MPI users use the standalone MPI Operator installation.
Now that we are merging the operator into the training operator, we should include in the docs that MPI only runs properly in non-istio namespaces.
Yes definitely good to add more docs on this. Let's leave this open to track that.
@alculquicondor I'm facing the same issue. Is there any solution to the permission issue
on istio? I don't think anyone found a workaround.