Kelly A comments

Repositories
Issues
Comments

Results 2 comments of


                                            Kelly A

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors

On reviewing this again, there's some possible solutions I can think of: 1. [On this code that checks if backofff limit is exceeded](https://github.com/kubeflow/training-operator/blob/5b2c6c8943fe6a1f8803f268f71ca714316fa6bc/pkg/core/job.go#L95), have it look for job restart events...

PytorchJob restartPolicy: ExitCode does not honor backoffLimit for retryable errors

If I set the FailurePolicy to OnFailure in the PyTorchJob, it restarts until backoffLimit is met. If I set the FailurePolicy to ExitCode in the PyTorchJob, it ignores the backoffLimit...