Intermittent Pod Restarts with PyTorch Plugin in Volcano Jobs
What happened:
I've encountered an issue while following the instructions in the PyTorch plugin user guide (https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_pytorch_plugin.md.
After applying the job YAML as per the documentation, the pods were repeatedly restarting. Interestingly, they eventually succeeded after numerous restarts (dozens of times). This behavior seems like a potential bug.
However, when I added the svc plugin to the job configuration, the pods started successfully on the first attempt. This leads me to believe that there might be an issue with the PyTorch plugin.
How to reproduce it (as minimally and precisely as possible):
Apply the yaml given by the document above.
Environment:
- Volcano Version: latest
- Kubernetes version (use
kubectl version): v1.24.10
I retried many times, and finally, I found that with a worker count of 10 and the svc configuration, it still fails. So, the issue is not determined by whether the svc plugin exists or not; it's a problem with the code logic inside the container, and the code should have a retry mechanism instead of restart always.
I retried many times, and finally, I found that with a worker count of 10 and the
svcconfiguration, it still fails. So, the issue is not determined by whether thesvcplugin exists or not; it's a problem with the code logic inside the container, and the code should have a retry mechanism instead of restart always.
Yeah, seems ps and woker should guarantee start up order itself, or kubelet will restart them until success.