pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

container "pytorch" is waiting to start: PodInitializing

Open gogogwwb opened this issue 4 years ago • 20 comments

When the master is finished running, the worker is still initializing.

1629019887(1)

worker log: Error from server (BadRequest): container "pytorch" in pod"xxx-jxosi-worker-0" is waiting to start: PodInitializing

What is the reason for this?

gogogwwb avatar Aug 15 '21 09:08 gogogwwb

Could you please run kubectl describe and post the result here?

gaocegege avatar Aug 16 '21 02:08 gaocegege

Can you show more about it? Especially the events section.

gaocegege avatar Aug 16 '21 03:08 gaocegege

Seems that the init container is pending. Can you show its log?

gaocegege avatar Aug 16 '21 06:08 gaocegege

init-pytorch log: image

gogogwwb avatar Aug 16 '21 06:08 gogogwwb

master svc :

image

gogogwwb avatar Aug 16 '21 06:08 gogogwwb

Can you try kubectl debug to run an ephemeral container, then run ping xxx-master-0?

gaocegege avatar Aug 16 '21 12:08 gaocegege

ping master-0 in an ephemeral container : image

gogogwwb avatar Aug 16 '21 12:08 gogogwwb

kubectl get ep -A,then i found ep appeared, but disappeared again after a while

image

image

gogogwwb avatar Aug 16 '21 12:08 gogogwwb

It's weird.

gaocegege avatar Aug 17 '21 02:08 gaocegege

I put the program to sleep for a while and found that the worker can run. Is there any restriction on the creation order of service and pod in pytorchjob?

gogogwwb avatar Aug 17 '21 02:08 gogogwwb

It should be that the master executes too fast, so that the ep of the service finally becomes none, and the worker cannot obtain the IP address of the master.

gogogwwb avatar Aug 17 '21 03:08 gogogwwb

It should be that the master executes too fast, so that the ep of the service finally becomes none, and the worker cannot obtain the IP address of the master.

Interesting. /cc @johnugeorge

gaocegege avatar Aug 17 '21 03:08 gaocegege

But in that case, master should not start the job until workers are up. Are you using distributed setup itself in the code?

johnugeorge avatar Aug 17 '21 04:08 johnugeorge

I did not use distributed steps in the code. After master running, it becomes completed state, "kubectl get ep -ntest", and found that ep is none

gogogwwb avatar Aug 17 '21 07:08 gogogwwb

But in that case, master should not start the job until workers are up. Are you using distributed setup itself in the code?

If the code is not sleeping, the master is in a short running state, but the worker is in the init state

gogogwwb avatar Aug 17 '21 07:08 gogogwwb

If you are not using distributed pytorch in the code, this can happen. Master can start executing and gets completed before worker starts. Can you confirm whether you are using distributed APIs?

johnugeorge avatar Aug 18 '21 11:08 johnugeorge

If you are not using distributed pytorch in the code, this can happen. Master can start executing and gets completed before worker starts. Can you confirm whether you are using distributed APIs?

I think i don't use distributed APIs. the code:

image

gogogwwb avatar Sep 13 '21 11:09 gogogwwb

That is issue. Any reason in using pytorch-operator without using distributed version?

Example:

https://github.com/kubeflow/tf-operator/blob/1aa44a68cd364ed6e30c0841e6daf1d93a29f146/examples/pytorch/mnist/mnist.py#L72

johnugeorge avatar Sep 14 '21 13:09 johnugeorge

How do I make the pod created by pytorchjob not automatically disappear after completion, is it to use cleanpolicy? How to set it up? thanks

gogogwwb avatar Sep 29 '21 11:09 gogogwwb

Set to None

https://github.com/kubeflow/common/blob/f7c41a08761ff3b215553a051fd529efd22782a1/pkg/apis/common/v1/types.go#L136

johnugeorge avatar Oct 09 '21 11:10 johnugeorge