pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

PyTorchJob worker pods crashloops in non-default namespace

Open jobvarkey opened this issue 5 years ago • 7 comments

Hello,

I am running kubernetes v1.15.7 and kubeflow 0.70 on a 6 workers node on-prem cluster. each node has 2 GPUs.

The provided mnist.py works fine when running under the default namespace (kubectl apply -f pytorch_job_mnist_gloo.yaml).

But the worker pod(s) crashloops when submitted under a non-default namespace (for example: kubectl apply -f pytorch_job_mnist_gloo.yaml -n i70994). The master pod is in running state.

root@0939-jdeml-m01:/tmp# kubectl get pods -n i70994 NAME READY STATUS RESTARTS AGE jp-nb1-0 2/2 Running 0 18h pytorch-dist-mnist-gloo-master-0 2/2 Running 1 33m pytorch-dist-mnist-gloo-worker-0 1/2 CrashLoopBackOff 11 33m

kubectl_describe_pod_pytorch-dist-mnist-gloo-master-0.txt kubectl_describe_pod_pytorch-dist-mnist-gloo-worker-0.txt kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_istio-system.txt kubectl_logs_pytorch-dist-mnist-gloo-worker-0_container_pytorch.txt

pytorch_job_mnist_gloo.yaml.txt

Can anyone please help with this issue?

Thanks, Job

jobvarkey avatar Feb 05 '20 15:02 jobvarkey

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 0.74

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Feb 05 '20 15:02 issue-label-bot[bot]

It seems that the istio proxy is injected into the training pod. Are you running the job at kubeflow namespace?

gaocegege avatar Feb 06 '20 01:02 gaocegege

The job is running at namespace 'i70994'. This namespace was created when I login to kubeflow UI for the first time. Thanks

jobvarkey avatar Feb 06 '20 04:02 jobvarkey

Can you show me the result of kubectl describe ns i70994?

gaocegege avatar Feb 06 '20 06:02 gaocegege

root@0939-jdeml-m01:~# kubectl describe ns i70994 Name: i70994 Labels: istio-injection=enabled katib-metricscollector-injection=enabled serving.kubeflow.org/inferenceservice=enabled Annotations: owner: [email protected] Status: Active

No resource quota.

No resource limits.

jobvarkey avatar Feb 06 '20 14:02 jobvarkey

Hi @jobvarkey I guess that this cause is istio-injection enabled on your ns. Could you try to append below code to template section in pytorch_job_mnist_gloo.yaml. You can disable istio-injection your PyTorchJob.

        metadata:
          annotations:
            sidecar.istio.io/inject: "false"

see: https://istio.io/docs/setup/additional-setup/sidecar-injection/

But I don't know if this change affect other problem. Could anyone explain it ?

By setting this change, I was able to run mnist_gloo.

636 avatar Apr 08 '20 04:04 636

But I don't know if this change affect other problem. Could anyone explain it ?

This comment provides details https://github.com/kubeflow/kubeflow/issues/4935#issuecomment-615256808

Basically if it disables Istio sidecar injection, ANY pod within a cluster can access your pytorchjob pods via pod name without mTLS. e.g., pytorch-dist-mnist-gloo-master-0.<namespace>.

By default, when running a PyTorchJob in user namespace profile which has ISTIO side car injection enabled, it will get the error message from worker pods like RuntimeError: Connection reset by peer.

shawnzhu avatar Jun 23 '20 02:06 shawnzhu