training-operator PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors

Steps to reproduce:

Set the PyTorchJob restartPolicy: ExitCode
Set backoffLimit > 1
Have a container exit with a non-zero exit code greater than 128

Observed Behavior:

When the pod errors out, the controller deletes the pod and recreates it. This is done indefinitely until the pod completes successfully (if it ever does!). It ignores the backoffLimit.

Analysis:

The restart behavior for "OnFailure" vs "ExitCode" is different in that "OnFailure". For "OnFailure", the Pod's actual restart policy is set to "OnFailure", leaving the K8s pod controller to do the restarts. In the case of "ExitCode", then pod's restart policy is set to "Never", and therefore the restarts are controlled via deletion of the pod. As far as I can tell, I think this is at least partially due to the fact that the code that checks for exceeding backoff limit keys off of container restarts. Since in this case it is deleting and recreating pods, there are no container restarts.

Apr 05 '24 21:04 kellyaa

On reviewing this again, there's some possible solutions I can think of:

On this code that checks if backofff limit is exceeded, have it look for job restart events instead of container restarts. In looking at the code though, I don't see that this information about the job's events is currently available in this context.
Forget this effort until the incorporation of batchv1 jobs is complete. That way, we can rely on the Job's restart policy (similar to how OnFailure is relying on the pod's own restart policy) instead of inventing a way propreitary to the training operator.

My understanding is the date for (2) is unknown and is pending job success policy becoming beta in Kubernetes (no date). Selfishly, I'd like to use this feature sooner than later, so if there's any other short term fix that would be great. What do you think about 1). ?

@andreyvelich @tenzen-y

Apr 12 '24 17:04 kellyaa

On reviewing this again, there's some possible solutions I can think of:

On this code that checks if backofff limit is exceeded, have it look for job restart events instead of container restarts. In looking at the code though, I don't see that this information about the job's events is currently available in this context.

Forget this effort until the incorporation of batchv1 jobs is complete. That way, we can rely on the Job's restart policy (similar to how OnFailure is relying on the pod's own restart policy) instead of inventing a way propreitary to the training operator.

My understanding is the date for (2) is unknown and is pending job success policy becoming beta in Kubernetes (no date). Selfishly, I'd like to use this feature sooner than later, so if there's any other short term fix that would be great. What do you think about 1). ?

@andreyvelich @tenzen-y

This is actually a specification, not a bug. You can see a similar feature in the batch/v1 Job w/ PodFailurePolicy action=Ignore. So, I think that this is a feature request similar to the batch/v1 Job w/ PodFailurePolicy action=Count.

Apr 17 '24 14:04 tenzen-y

Since selecting (1) means breaking change, we can not select the approach. So, adding a new field like the batch/v1 Job action field might be better.

Apr 17 '24 14:04 tenzen-y

/kind feature

Apr 17 '24 14:04 tenzen-y

Hi @tenzen-y Are you going to work on this API change ? or how far have you started so far.

May 02 '24 21:05 tedhtchang

Hi @tenzen-y Are you going to work on this API change ? or how far have you started so far.

TBH, I would like not to add this feature to v1 API because we need to re-implement the same feature in the training-operator. But, we started the v2 API design (using JobSet).

May 07 '24 11:05 tenzen-y

@tenzen-y Do you think we should make this logic work ? It does not require API change. The reason the logic is failing is because this previousRetry variable always returns 0 as result of this fakeWorkQueue.

May 10 '24 02:05 tedhtchang

@tenzen-y Do you think we should make this logic work ? It does not require API change. The reason the logic is failing is because this previousRetry variable always returns 0 as result of this fakeWorkQueue.

As I mentioned here https://github.com/kubeflow/training-operator/issues/2045#issuecomment-2061390914, changing the current behavior without another knob brings us breaking changes. So, I don't think it could be.

May 10 '24 05:05 tenzen-y

If I set the FailurePolicy to OnFailure in the PyTorchJob, it restarts until backoffLimit is met. If I set the FailurePolicy to ExitCode in the PyTorchJob, it ignores the backoffLimit and restarts indefinitely.

To me, the current behavior seems like a bug rather than a feature as far as user experience goes. Changing how the restarts are counted for ExitCode seems like it would be a fix to unexpected behavior, not introducing another knob.

May 10 '24 14:05 kellyaa

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 08 '24 15:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Aug 28 '24 15:08 github-actions[bot]

training-operator training-operator copied to clipboard

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors

Steps to reproduce:

Observed Behavior:

Analysis:

training-operator
training-operator copied to clipboard