pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

Example PytorchJob is not starting

Open natalytvinova opened this issue 4 years ago • 8 comments

I set up Kubeflow v0.6.0 on Microk8s v1.17. After executing kubectl create -f pytorch_job_mnist_gloo.yaml from the example I can see PytorchJob created, but no events on it happening and no new pods created. Is this example still relevant?

natalytvinova avatar Mar 20 '20 13:03 natalytvinova

Issue-Label Bot is automatically applying the labels:

Label Probability
bug 0.69

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Mar 20 '20 13:03 issue-label-bot[bot]

I have the same issue with Kubeflow v1.0 and Microk8s v1.18

maartenpants avatar Mar 31 '20 18:03 maartenpants

I also met same issue on my environment (kubeflow v1.0.1 with microk8s 1.15) the status is follows.

$ kubectl get pods 
NAME                               READY   STATUS    RESTARTS   AGE
pytorch-dist-mnist-gloo-master-0   0/1     Pending   0          27m
pytorch-dist-mnist-gloo-worker-0   0/1     Pending   0          27m

sakaia avatar Apr 20 '20 01:04 sakaia

In my environment, logs are follows . I edit v1/pytorch_job_mnist_gloo.yaml for image as gcr.io/kubeflow-ci/pytorch_dist_mnist:latest and comment out GPU. Is there something needed for running sample?

$ kubectl logs pytorch-dist-mnist-gloo-master-0
Error from server (BadRequest): container "pytorch" in pod "pytorch-dist-mnist-gloo-master-0" is waiting to start: trying and failing to pull image
$ kubectl logs pytorch-dist-mnist-gloo-worker-0
Error from server (BadRequest): container "pytorch" in pod "pytorch-dist-mnist-gloo-worker-0" is waiting to start: PodInitializing

sakaia avatar Apr 21 '20 07:04 sakaia

I edit pytorch_job_mnist_gloo.yaml image attribute. it works fine.

image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0

sakaia avatar Apr 22 '20 04:04 sakaia

you should check the pytorchjob status by

kubectl get events

sakaia avatar Apr 22 '20 04:04 sakaia

After running docker build, the image that gets created is on the local registry. This image should be loaded into the microk8s cluster before creating the job. The instructions for this are available here.

jvujjini avatar Apr 30 '20 01:04 jvujjini

Thank you for your suggestion. My expection is just working within README.md operation. I hope the document wrote need to edit imege:

sakaia avatar Apr 30 '20 03:04 sakaia