training-operator
training-operator copied to clipboard
When setting restartPolicy to OnFailure in PyTorchJob, is there something like maxRetartCount
When restartPolicy is set to OnFailure in PyTorchJob, if the worker always failed, it will be restarted continueously.
I would like to know if there is a configuration like maxRestartCount, if worker restart count reaches the limit, the PyTorchJob just fail directly and release resources.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Maybe, we can support this feature once https://github.com/kubeflow/training-operator/issues/1718 is done. (batch/v1 Job backoffLimitPerIndex)
/lifecycle frozen
Currently, we can apply backOffLimit to the entire Job: https://github.com/kubeflow/training-operator/blob/afba76bc5a168cbcbc8685c7661f36e9b787afd1/pkg/apis/kubeflow.org/v1/common_types.go#L204-L206