pytorch-operator
pytorch-operator copied to clipboard
Example PytorchJob is not starting
I set up Kubeflow v0.6.0 on Microk8s v1.17. After executing kubectl create -f pytorch_job_mnist_gloo.yaml from the example I can see PytorchJob created, but no events on it happening and no new pods created. Is this example still relevant?
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
bug | 0.69 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
I have the same issue with Kubeflow v1.0 and Microk8s v1.18
I also met same issue on my environment (kubeflow v1.0.1 with microk8s 1.15) the status is follows.
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
pytorch-dist-mnist-gloo-master-0 0/1 Pending 0 27m
pytorch-dist-mnist-gloo-worker-0 0/1 Pending 0 27m
In my environment, logs are follows .
I edit v1/pytorch_job_mnist_gloo.yaml
for image as gcr.io/kubeflow-ci/pytorch_dist_mnist:latest
and comment out GPU.
Is there something needed for running sample?
$ kubectl logs pytorch-dist-mnist-gloo-master-0
Error from server (BadRequest): container "pytorch" in pod "pytorch-dist-mnist-gloo-master-0" is waiting to start: trying and failing to pull image
$ kubectl logs pytorch-dist-mnist-gloo-worker-0
Error from server (BadRequest): container "pytorch" in pod "pytorch-dist-mnist-gloo-worker-0" is waiting to start: PodInitializing
I edit pytorch_job_mnist_gloo.yaml
image attribute. it works fine.
image: gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0
you should check the pytorchjob status by
kubectl get events
After running docker build
, the image that gets created is on the local registry. This image should be loaded into the microk8s cluster before creating the job. The instructions for this are available here.
Thank you for your suggestion. My expection is just working within README.md operation. I hope the document wrote need to edit imege: