volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Intermittent Pod Restarts with PyTorch Plugin in Volcano Jobs

Open daniel-hutao opened this issue 2 years ago • 2 comments

What happened:

I've encountered an issue while following the instructions in the PyTorch plugin user guide (https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md.

After applying the job YAML as per the documentation, the pods were repeatedly restarting. Interestingly, they eventually succeeded after numerous restarts (dozens of times). This behavior seems like a potential bug.

image image

However, when I added the svc plugin to the job configuration, the pods started successfully on the first attempt. This leads me to believe that there might be an issue with the PyTorch plugin.

image

How to reproduce it (as minimally and precisely as possible):

Apply the yaml given by the document above.

Environment:

  • Volcano Version: latest
  • Kubernetes version (use kubectl version): v1.24.10

daniel-hutao avatar Jan 04 '24 07:01 daniel-hutao

I retried many times, and finally, I found that with a worker count of 10 and the svc configuration, it still fails. So, the issue is not determined by whether the svc plugin exists or not; it's a problem with the code logic inside the container, and the code should have a retry mechanism instead of restart always.

daniel-hutao avatar Jan 04 '24 08:01 daniel-hutao

I retried many times, and finally, I found that with a worker count of 10 and the svc configuration, it still fails. So, the issue is not determined by whether the svc plugin exists or not; it's a problem with the code logic inside the container, and the code should have a retry mechanism instead of restart always.

Yeah, seems ps and woker should guarantee start up order itself, or kubelet will restart them until success.

Monokaix avatar Jan 05 '24 06:01 Monokaix