common add job suspend run Policy

add job partial success status

May 17 '22 02:05 PeterChg

/assign @terrytangyuan

May 20 '22 06:05 PeterChg

I am not sure if this is a common use case. Could you elaborate?

The ability to suspend and resume Jobs is often desired when cluster resources are limited and a higher priority Job needs to execute in the place of another Job. According to the kubeflow/training-operator project architecture, need to modify kubeflow/common project first.

May 23 '22 02:05 PeterChg

/ok-to-test

May 23 '22 02:05 gaocegege

What are the changes you are trying to make to training operator?

May 23 '22 14:05 terrytangyuan

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

May 24 '22 02:05 PeterChg

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please ask for approval from gaocegege after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

May 24 '22 06:05 google-oss-prow[bot]

What are the changes you are trying to make to training operator?

add some logic in pytorch job lifecycle, delete pods when job suspened, create pods when job resumed. optimize pytorchjob status management module, keep the function right after adding suspend/resume state.

I am not sure if suspend is common in distributed training jobs. There will be side effects depending on the training framework, especially when pods are deleted and recreated.

May 26 '22 15:05 terrytangyuan

This is not about the training job itself. This is about a cluster having scarce resources. If there is a higher priority job that needs the resources, suspend provides a way to free those resources. The training job will have a chance to checkpoint, if they have the support for it, otherwise just fail and they will be retried later.

Jan 27 '23 16:01 alculquicondor

common common copied to clipboard

add job suspend run Policy

common
common copied to clipboard