common icon indicating copy to clipboard operation
common copied to clipboard

Add job suspend semantics

Open xiaoxubeii opened this issue 2 years ago • 16 comments

To support job suspend semantics like Kubernetes batch job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job

xiaoxubeii avatar Jul 04 '22 09:07 xiaoxubeii

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign gaocegege after the PR has been reviewed. You can assign the PR to them by writing /assign @gaocegege in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Jul 04 '22 09:07 google-oss-prow[bot]

/ok-to-test

gaocegege avatar Jul 04 '22 09:07 gaocegege

Thanks for the PR, is it ready to review?

gaocegege avatar Jul 04 '22 09:07 gaocegege

Thanks for the PR, is it ready to review?

@gaocegege Ready for review. Thanks :)

xiaoxubeii avatar Jul 14 '22 00:07 xiaoxubeii

How is this PR going now?

ggaaooppeenngg avatar Oct 17 '22 08:10 ggaaooppeenngg

Is this actively being worked on? Or will we get rid of the common repo first?

alculquicondor avatar Jan 05 '23 13:01 alculquicondor

Is this actively being worked on? Or will we get rid of the common repo first?

@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up.

https://github.com/kubeflow/training-operator/issues/1683

Wed Jan 25th 2023 Week 18 Release Team Feature Freeze

https://github.com/kubeflow/community/blob/6ba2e0e754166989d2f0d06aae827ceafdb65b29/releases/release-1.7/README.md

tenzen-y avatar Jan 06 '23 19:01 tenzen-y

Agree. we will take this up in next release after our merging kubeflow/common as planned in https://github.com/kubeflow/training-operator/issues/1714#issuecomment-1374537434

johnugeorge avatar Jan 10 '23 16:01 johnugeorge

@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later? It might give us a better chance to iterate faster and learn.

alculquicondor avatar Jan 20 '23 19:01 alculquicondor

@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later? It might give us a better chance to iterate faster and learn.

@alculquicondor Yes. that is a good idea. I was thinking of the same. Although, we need to move forward https://github.com/kubernetes-sigs/kueue/issues/369 before we adapt mpi-operator to Kueue.

tenzen-y avatar Jan 20 '23 20:01 tenzen-y

Excellent! We can work on the kueue side in parallel, while we add support for suspend in the mpi-operator.

alculquicondor avatar Jan 20 '23 20:01 alculquicondor

Excellent! We can work on the kueue side in parallel, while we add support for suspend in the mpi-operator.

You are right. I will work on the following steps after kubeflow feature freeze date (1/25) since I have no enough bandwidth for mpi-operator v2, now:

  1. https://github.com/kubeflow/mpi-operator/pull/502
  2. https://github.com/kubeflow/mpi-operator/issues/500
  3. Support suspend in mpi-operator

Although, other anyone can take step 3 after step 1 is completed.

tenzen-y avatar Jan 20 '23 20:01 tenzen-y

@mimowo will help with suspend in mpi-operator https://github.com/kubeflow/mpi-operator/issues/504

alculquicondor avatar Jan 24 '23 16:01 alculquicondor

Great! Thanks to @mimowo!

tenzen-y avatar Jan 24 '23 16:01 tenzen-y

Is this actively being worked on? Or will we get rid of the common repo first?

@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up.

kubeflow/training-operator#1683

Wed Jan 25th 2023 Week 18 Release Team Feature Freeze

https://github.com/kubeflow/community/blob/6ba2e0e754166989d2f0d06aae827ceafdb65b29/releases/release-1.7/README.md

Agreed. We could try to work on Job suspend feature for kubeflow v1.8.

xiaoxubeii avatar Mar 14 '23 02:03 xiaoxubeii

@johnugeorge how are we doing with the branch creation? Can we proceed with this PR or move it to training-operator?

alculquicondor avatar Mar 29 '23 15:03 alculquicondor