jobset
jobset copied to clipboard
support elastic jobset
Fixes https://github.com/kubernetes-sigs/jobset/issues/463
Add elastic jobset support (downscale/upscaling replicas of a Replicated Job).
Before this change, we treated ReplicateJobs (in JobSet.Spec) as immutable. This relaxes that restriction.
I choose to validate that names cannot be changed and the job templates must not change during updates. I don't see a reason for names to change and we also don't allow order change of the replicated job list so we can detect differences easily.
I am reopening this to resolve the rebase labels..
Deploy Preview for kubernetes-sigs-jobset ready!
| Name | Link |
|---|---|
| Latest commit | bf458c6386b610e49d236f0e38585fe9087553cd |
| Latest deploy log | https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/67000c7d0434c600085e7192 |
| Deploy Preview | https://deploy-preview-622--kubernetes-sigs-jobset.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
/assign @danielvegamyhre
@mimowo wouls you be open for a review also?
/cc @tenzen-y
/hold
Missed the conflicts..
@mimowo wouls you be open for a review also?
Ack, I will try this week, but don't wait for my review unnecessarily if you get enough input from others.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: kannon92 Once this PR has been reviewed and has the lgtm label, please ask for approval from danielvegamyhre. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@danielvegamyhre do you know how to remove the rebase label? I did resolve conflicts in my branch and pushes those up. I think the prow removal of it is not working.
ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1708018595891739
Looks like rebase label can have issues. So I think removing is fine. It also doesn't block the merge so this is ready for review.
/remove needs-rebase
@danielvegamyhre do you know how to remove the rebase label? I did resolve conflicts in my branch and pushes those up. I think the prow removal of it is not working.
ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1708018595891739
Looks like rebase label can have issues. So I think removing is fine. It also doesn't block the merge so this is ready for review.
I removed it via the github UI, the "labels" button on the top right.
I think since you have admin rights to the repo you have access to that.
Thank you though!
/hold cancel
Pushed up some e2e tests for this. I need to work on the downscale case.
What should be the behavior for elastic jobs on a downscale?
If more jobs exists, do we delete the extra jobs only? Or should we delete and recreate all the jobs?
On upscale, the extra jobs are created and I think the original ones are left alone.
On downscale, I guess we could delete the extra replicas.
@kannon92 Looks like e2e tests are failing, can you take a look?
Yes, my hope is to get some clarity on the design/code.
Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?
Yes, my hope is to get some clarity on the design/code.
Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?
Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.
Yes, my hope is to get some clarity on the design/code. Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?
Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.
Maybe Iām overthinking it.
Say I have 4 jobs and I downscale to 2. Should I only delete the 3rd and 4th replicated job via index? Say I want to keep the 1st and 2nd job running and I remove the excess ones?
Yes, my hope is to get some clarity on the design/code. Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?
Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.
Maybe Iām overthinking it.
Say I have 4 jobs and I downscale to 2. Should I only delete the 3rd and 4th replicated job via index? Say I want to keep the 1st and 2nd job running and I remove the excess ones?
If I recall correctly the Job controller works this way for Elastic Indexed Jobs, removing indexes from largest to smallest until there are no more excess pods, so it would make sense to follow that pattern here for consistency I think.
@danielvegamyhre @mimowo
I got the downscale to work so I think this PR is ready for review.
@ahg-g id thinking we should clear use cases with this in mind. I am not a ML engineer so Iām not really able to give clear concrete cases for this. Let me ask around and see. Otherwise I may close this and wait until we have a customer ask for this.
/close
We need more design/use cases on this. Going to close for now.
@kannon92: Closed this PR.
In response to this:
/close
We need more design/use cases on this. Going to close for now.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.