common
common copied to clipboard
add job suspend run Policy
add job partial success status
/assign @terrytangyuan
I am not sure if this is a common use case. Could you elaborate?
The ability to suspend and resume Jobs is often desired when cluster resources are limited and a higher priority Job needs to execute in the place of another Job. According to the kubeflow/training-operator project architecture, need to modify kubeflow/common project first.
/ok-to-test
What are the changes you are trying to make to training operator?
What are the changes you are trying to make to training operator?
add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: To complete the pull request process, please ask for approval from gaocegege after the PR has been reviewed.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
What are the changes you are trying to make to training operator?
add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.
I am not sure if suspend is common in distributed training jobs. There will be side effects depending on the training framework, especially when pods are deleted and recreated.
This is not about the training job itself.
This is about a cluster having scarce resources. If there is a higher priority job that needs the resources, suspend provides a way to free those resources. The training job will have a chance to checkpoint, if they have the support for it, otherwise just fail and they will be retried later.