training-operator When setting restartPolicy to OnFailure in PyTorchJob, is there something like maxRetartCount

When setting restartPolicy to OnFailure in PyTorchJob, is there something like maxRetartCount

Open zhiyxu opened this issue 3 years ago • 3 comments

When restartPolicy is set to OnFailure in PyTorchJob, if the worker always failed, it will be restarted continueously. I would like to know if there is a configuration like maxRestartCount, if worker restart count reaches the limit, the PyTorchJob just fail directly and release resources.

May 07 '22 05:05 zhiyxu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 14 '23 05:09 github-actions[bot]

Maybe, we can support this feature once https://github.com/kubeflow/training-operator/issues/1718 is done. (batch/v1 Job backoffLimitPerIndex)

/lifecycle frozen

Sep 14 '23 05:09 tenzen-y

Currently, we can apply backOffLimit to the entire Job: https://github.com/kubeflow/training-operator/blob/afba76bc5a168cbcbc8685c7661f36e9b787afd1/pkg/apis/kubeflow.org/v1/common_types.go#L204-L206

Sep 14 '23 06:09 tenzen-y

training-operator training-operator copied to clipboard

When setting restartPolicy to OnFailure in PyTorchJob, is there something like maxRetartCount

training-operator
training-operator copied to clipboard