training-operator
training-operator copied to clipboard
PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors
Steps to reproduce:
- Set the PyTorchJob restartPolicy: ExitCode
- Set backoffLimit > 1
- Have a container exit with a non-zero exit code greater than 128
Observed Behavior:
When the pod errors out, the controller deletes the pod and recreates it. This is done indefinitely until the pod completes successfully (if it ever does!). It ignores the backoffLimit.
Analysis:
The restart behavior for "OnFailure" vs "ExitCode" is different in that "OnFailure". For "OnFailure", the Pod's actual restart policy is set to "OnFailure", leaving the K8s pod controller to do the restarts. In the case of "ExitCode", then pod's restart policy is set to "Never", and therefore the restarts are controlled via deletion of the pod. As far as I can tell, I think this is at least partially due to the fact that the code that checks for exceeding backoff limit keys off of container restarts. Since in this case it is deleting and recreating pods, there are no container restarts.
On reviewing this again, there's some possible solutions I can think of:
- On this code that checks if backofff limit is exceeded, have it look for job restart events instead of container restarts. In looking at the code though, I don't see that this information about the job's events is currently available in this context.
- Forget this effort until the incorporation of batchv1 jobs is complete. That way, we can rely on the Job's restart policy (similar to how OnFailure is relying on the pod's own restart policy) instead of inventing a way propreitary to the training operator.
My understanding is the date for (2) is unknown and is pending job success policy becoming beta in Kubernetes (no date). Selfishly, I'd like to use this feature sooner than later, so if there's any other short term fix that would be great. What do you think about 1). ?
@andreyvelich @tenzen-y
On reviewing this again, there's some possible solutions I can think of:
- On this code that checks if backofff limit is exceeded, have it look for job restart events instead of container restarts. In looking at the code though, I don't see that this information about the job's events is currently available in this context.
- Forget this effort until the incorporation of batchv1 jobs is complete. That way, we can rely on the Job's restart policy (similar to how OnFailure is relying on the pod's own restart policy) instead of inventing a way propreitary to the training operator.
My understanding is the date for (2) is unknown and is pending job success policy becoming beta in Kubernetes (no date). Selfishly, I'd like to use this feature sooner than later, so if there's any other short term fix that would be great. What do you think about 1). ?
@andreyvelich @tenzen-y
This is actually a specification, not a bug. You can see a similar feature in the batch/v1 Job w/ PodFailurePolicy action=Ignore
.
So, I think that this is a feature request similar to the batch/v1 Job w/ PodFailurePolicy action=Count
.
Since selecting (1) means breaking change, we can not select the approach.
So, adding a new field like the batch/v1 Job action
field might be better.
/kind feature
Hi @tenzen-y Are you going to work on this API change ? or how far have you started so far.
Hi @tenzen-y Are you going to work on this API change ? or how far have you started so far.
TBH, I would like not to add this feature to v1 API because we need to re-implement the same feature in the training-operator. But, we started the v2 API design (using JobSet).
@tenzen-y Do you think we should make this logic work ? It does not require API change. The reason the logic is failing is because this previousRetry variable always returns 0 as result of this fakeWorkQueue.
@tenzen-y Do you think we should make this logic work ? It does not require API change. The reason the logic is failing is because this previousRetry variable always returns 0 as result of this fakeWorkQueue.
As I mentioned here https://github.com/kubeflow/training-operator/issues/2045#issuecomment-2061390914, changing the current behavior without another knob brings us breaking changes. So, I don't think it could be.
If I set the FailurePolicy to OnFailure in the PyTorchJob, it restarts until backoffLimit is met. If I set the FailurePolicy to ExitCode in the PyTorchJob, it ignores the backoffLimit and restarts indefinitely.
To me, the current behavior seems like a bug rather than a feature as far as user experience goes. Changing how the restarts are counted for ExitCode seems like it would be a fix to unexpected behavior, not introducing another knob.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.