training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

When setting restartPolicy to OnFailure in PyTorchJob, is there something like maxRetartCount

Open zhiyxu opened this issue 3 years ago • 3 comments

When restartPolicy is set to OnFailure in PyTorchJob, if the worker always failed, it will be restarted continueously. I would like to know if there is a configuration like maxRestartCount, if worker restart count reaches the limit, the PyTorchJob just fail directly and release resources.

zhiyxu avatar May 07 '22 05:05 zhiyxu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 14 '23 05:09 github-actions[bot]

Maybe, we can support this feature once https://github.com/kubeflow/training-operator/issues/1718 is done. (batch/v1 Job backoffLimitPerIndex)

/lifecycle frozen

tenzen-y avatar Sep 14 '23 05:09 tenzen-y

Currently, we can apply backOffLimit to the entire Job: https://github.com/kubeflow/training-operator/blob/afba76bc5a168cbcbc8685c7661f36e9b787afd1/pkg/apis/kubeflow.org/v1/common_types.go#L204-L206

tenzen-y avatar Sep 14 '23 06:09 tenzen-y