common
common copied to clipboard
Add job suspend semantics
To support job suspend semantics like Kubernetes batch job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by:
To complete the pull request process, please assign gaocegege after the PR has been reviewed.
You can assign the PR to them by writing /assign @gaocegege
in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
/ok-to-test
Thanks for the PR, is it ready to review?
Thanks for the PR, is it ready to review?
@gaocegege Ready for review. Thanks :)
How is this PR going now?
Is this actively being worked on? Or will we get rid of the common repo first?
Is this actively being worked on? Or will we get rid of the common repo first?
@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up.
https://github.com/kubeflow/training-operator/issues/1683
Wed Jan 25th 2023 Week 18 Release Team Feature Freeze
https://github.com/kubeflow/community/blob/6ba2e0e754166989d2f0d06aae827ceafdb65b29/releases/release-1.7/README.md
Agree. we will take this up in next release after our merging kubeflow/common as planned in https://github.com/kubeflow/training-operator/issues/1714#issuecomment-1374537434
@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later? It might give us a better chance to iterate faster and learn.
@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later? It might give us a better chance to iterate faster and learn.
@alculquicondor Yes. that is a good idea. I was thinking of the same. Although, we need to move forward https://github.com/kubernetes-sigs/kueue/issues/369 before we adapt mpi-operator to Kueue.
Excellent! We can work on the kueue side in parallel, while we add support for suspend in the mpi-operator.
Excellent! We can work on the kueue side in parallel, while we add support for suspend in the mpi-operator.
You are right. I will work on the following steps after kubeflow feature freeze date (1/25) since I have no enough bandwidth for mpi-operator v2, now:
- https://github.com/kubeflow/mpi-operator/pull/502
- https://github.com/kubeflow/mpi-operator/issues/500
- Support suspend in mpi-operator
Although, other anyone can take step 3 after step 1 is completed.
@mimowo will help with suspend in mpi-operator https://github.com/kubeflow/mpi-operator/issues/504
Great! Thanks to @mimowo!
Is this actively being worked on? Or will we get rid of the common repo first?
@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up.
kubeflow/training-operator#1683
Wed Jan 25th 2023 Week 18 Release Team Feature Freeze
https://github.com/kubeflow/community/blob/6ba2e0e754166989d2f0d06aae827ceafdb65b29/releases/release-1.7/README.md
Agreed. We could try to work on Job suspend feature for kubeflow v1.8.
@johnugeorge how are we doing with the branch creation? Can we proceed with this PR or move it to training-operator?