pytorch-operator container "pytorch" is waiting to start: PodInitializing

When the master is finished running, the worker is still initializing.

1629019887(1)

worker log： Error from server (BadRequest): container "pytorch" in pod"xxx-jxosi-worker-0" is waiting to start: PodInitializing

What is the reason for this?

Aug 15 '21 09:08 gogogwwb

Could you please run kubectl describe and post the result here?

Aug 16 '21 02:08 gaocegege

Can you show more about it? Especially the events section.

Aug 16 '21 03:08 gaocegege

Seems that the init container is pending. Can you show its log?

Aug 16 '21 06:08 gaocegege

init-pytorch log：

Aug 16 '21 06:08 gogogwwb

master svc ：

Aug 16 '21 06:08 gogogwwb

Can you try kubectl debug to run an ephemeral container, then run ping xxx-master-0?

Aug 16 '21 12:08 gaocegege

ping master-0 in an ephemeral container :

Aug 16 '21 12:08 gogogwwb

kubectl get ep -A，then i found ep appeared, but disappeared again after a while

Aug 16 '21 12:08 gogogwwb

It's weird.

Aug 17 '21 02:08 gaocegege

I put the program to sleep for a while and found that the worker can run. Is there any restriction on the creation order of service and pod in pytorchjob?

Aug 17 '21 02:08 gogogwwb

It should be that the master executes too fast, so that the ep of the service finally becomes none, and the worker cannot obtain the IP address of the master.

Aug 17 '21 03:08 gogogwwb

It should be that the master executes too fast, so that the ep of the service finally becomes none, and the worker cannot obtain the IP address of the master.

Interesting. /cc @johnugeorge

Aug 17 '21 03:08 gaocegege

But in that case, master should not start the job until workers are up. Are you using distributed setup itself in the code?

Aug 17 '21 04:08 johnugeorge

I did not use distributed steps in the code. After master running, it becomes completed state, "kubectl get ep -ntest", and found that ep is none

Aug 17 '21 07:08 gogogwwb

But in that case, master should not start the job until workers are up. Are you using distributed setup itself in the code?

If the code is not sleeping, the master is in a short running state, but the worker is in the init state

Aug 17 '21 07:08 gogogwwb

If you are not using distributed pytorch in the code, this can happen. Master can start executing and gets completed before worker starts. Can you confirm whether you are using distributed APIs?

Aug 18 '21 11:08 johnugeorge

If you are not using distributed pytorch in the code, this can happen. Master can start executing and gets completed before worker starts. Can you confirm whether you are using distributed APIs?

I think i don't use distributed APIs. the code：

Sep 13 '21 11:09 gogogwwb

That is issue. Any reason in using pytorch-operator without using distributed version?

Example:

https://github.com/kubeflow/tf-operator/blob/1aa44a68cd364ed6e30c0841e6daf1d93a29f146/examples/pytorch/mnist/mnist.py#L72

Sep 14 '21 13:09 johnugeorge

How do I make the pod created by pytorchjob not automatically disappear after completion, is it to use cleanpolicy? How to set it up? thanks

Sep 29 '21 11:09 gogogwwb

Set to None

https://github.com/kubeflow/common/blob/f7c41a08761ff3b215553a051fd529efd22782a1/pkg/apis/common/v1/types.go#L136

Oct 09 '21 11:10 johnugeorge

pytorch-operator pytorch-operator copied to clipboard

container "pytorch" is waiting to start: PodInitializing

pytorch-operator
pytorch-operator copied to clipboard