Kelly A
Results
2
comments of
Kelly A
On reviewing this again, there's some possible solutions I can think of: 1. [On this code that checks if backofff limit is exceeded](https://github.com/kubeflow/training-operator/blob/5b2c6c8943fe6a1f8803f268f71ca714316fa6bc/pkg/core/job.go#L95), have it look for job restart events...
If I set the FailurePolicy to OnFailure in the PyTorchJob, it restarts until backoffLimit is met. If I set the FailurePolicy to ExitCode in the PyTorchJob, it ignores the backoffLimit...