common icon indicating copy to clipboard operation
common copied to clipboard

add job suspend run Policy

Open PeterChg opened this issue 3 years ago • 8 comments

add job partial success status

PeterChg avatar May 17 '22 02:05 PeterChg

/assign @terrytangyuan

PeterChg avatar May 20 '22 06:05 PeterChg

I am not sure if this is a common use case. Could you elaborate?

The ability to suspend and resume Jobs is often desired when cluster resources are limited and a higher priority Job needs to execute in the place of another Job. According to the kubeflow/training-operator project architecture, need to modify kubeflow/common project first.

PeterChg avatar May 23 '22 02:05 PeterChg

/ok-to-test

gaocegege avatar May 23 '22 02:05 gaocegege

What are the changes you are trying to make to training operator?

terrytangyuan avatar May 23 '22 14:05 terrytangyuan

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

PeterChg avatar May 24 '22 02:05 PeterChg

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please ask for approval from gaocegege after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar May 24 '22 06:05 google-oss-prow[bot]

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

I am not sure if suspend is common in distributed training jobs. There will be side effects depending on the training framework, especially when pods are deleted and recreated.

terrytangyuan avatar May 26 '22 15:05 terrytangyuan

This is not about the training job itself. This is about a cluster having scarce resources. If there is a higher priority job that needs the resources, suspend provides a way to free those resources. The training job will have a chance to checkpoint, if they have the support for it, otherwise just fail and they will be retried later.

alculquicondor avatar Jan 27 '23 16:01 alculquicondor