jobset icon indicating copy to clipboard operation
jobset copied to clipboard

support elastic jobset

Open kannon92 opened this issue 1 year ago • 18 comments

Fixes https://github.com/kubernetes-sigs/jobset/issues/463

Add elastic jobset support (downscale/upscaling replicas of a Replicated Job).

Before this change, we treated ReplicateJobs (in JobSet.Spec) as immutable. This relaxes that restriction.

I choose to validate that names cannot be changed and the job templates must not change during updates. I don't see a reason for names to change and we also don't allow order change of the replicated job list so we can detect differences easily.

I am reopening this to resolve the rebase labels..

kannon92 avatar Jul 22 '24 18:07 kannon92

Deploy Preview for kubernetes-sigs-jobset ready!

Name Link
Latest commit bf458c6386b610e49d236f0e38585fe9087553cd
Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/67000c7d0434c600085e7192
Deploy Preview https://deploy-preview-622--kubernetes-sigs-jobset.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Jul 22 '24 18:07 netlify[bot]

/assign @danielvegamyhre

@mimowo wouls you be open for a review also?

kannon92 avatar Jul 24 '24 12:07 kannon92

/cc @tenzen-y

kannon92 avatar Jul 25 '24 13:07 kannon92

/hold

Missed the conflicts..

kannon92 avatar Jul 27 '24 00:07 kannon92

@mimowo wouls you be open for a review also?

Ack, I will try this week, but don't wait for my review unnecessarily if you get enough input from others.

mimowo avatar Jul 29 '24 07:07 mimowo

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kannon92 Once this PR has been reviewed and has the lgtm label, please ask for approval from danielvegamyhre. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jul 29 '24 13:07 k8s-ci-robot

@danielvegamyhre do you know how to remove the rebase label? I did resolve conflicts in my branch and pushes those up. I think the prow removal of it is not working.

ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1708018595891739

Looks like rebase label can have issues. So I think removing is fine. It also doesn't block the merge so this is ready for review.

kannon92 avatar Jul 29 '24 13:07 kannon92

/remove needs-rebase

kannon92 avatar Jul 29 '24 19:07 kannon92

@danielvegamyhre do you know how to remove the rebase label? I did resolve conflicts in my branch and pushes those up. I think the prow removal of it is not working.

ref: https://kubernetes.slack.com/archives/C09QZ4DQB/p1708018595891739

Looks like rebase label can have issues. So I think removing is fine. It also doesn't block the merge so this is ready for review.

I removed it via the github UI, the "labels" button on the top right.

danielvegamyhre avatar Jul 29 '24 20:07 danielvegamyhre

I think since you have admin rights to the repo you have access to that.

Thank you though!

kannon92 avatar Jul 29 '24 21:07 kannon92

/hold cancel

kannon92 avatar Jul 30 '24 13:07 kannon92

Pushed up some e2e tests for this. I need to work on the downscale case.

What should be the behavior for elastic jobs on a downscale?

If more jobs exists, do we delete the extra jobs only? Or should we delete and recreate all the jobs?

On upscale, the extra jobs are created and I think the original ones are left alone.

On downscale, I guess we could delete the extra replicas.

kannon92 avatar Aug 04 '24 14:08 kannon92

@kannon92 Looks like e2e tests are failing, can you take a look?

danielvegamyhre avatar Aug 06 '24 17:08 danielvegamyhre

Yes, my hope is to get some clarity on the design/code.

Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

kannon92 avatar Aug 06 '24 17:08 kannon92

Yes, my hope is to get some clarity on the design/code.

Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.

danielvegamyhre avatar Aug 07 '24 17:08 danielvegamyhre

Yes, my hope is to get some clarity on the design/code. Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.

Maybe I’m overthinking it.

Say I have 4 jobs and I downscale to 2. Should I only delete the 3rd and 4th replicated job via index? Say I want to keep the 1st and 2nd job running and I remove the excess ones?

kannon92 avatar Aug 08 '24 02:08 kannon92

Yes, my hope is to get some clarity on the design/code. Downscaling doesn't work for sure. What should we do on a downscale of replicas in ReplicatedJobs? Do we delete all the jobs and recreate? Or do we terminate the jobs that are not in the list of replicas?

Hmm, I think if replicatedJob.replicas is downscaled from N to N-2, we should just delete 2 jobs so the state matches the spec.

Maybe I’m overthinking it.

Say I have 4 jobs and I downscale to 2. Should I only delete the 3rd and 4th replicated job via index? Say I want to keep the 1st and 2nd job running and I remove the excess ones?

If I recall correctly the Job controller works this way for Elastic Indexed Jobs, removing indexes from largest to smallest until there are no more excess pods, so it would make sense to follow that pattern here for consistency I think.

danielvegamyhre avatar Aug 08 '24 04:08 danielvegamyhre

@danielvegamyhre @mimowo

I got the downscale to work so I think this PR is ready for review.

kannon92 avatar Aug 17 '24 14:08 kannon92

@ahg-g id thinking we should clear use cases with this in mind. I am not a ML engineer so I’m not really able to give clear concrete cases for this. Let me ask around and see. Otherwise I may close this and wait until we have a customer ask for this.

kannon92 avatar Oct 14 '24 12:10 kannon92

/close

We need more design/use cases on this. Going to close for now.

kannon92 avatar Oct 14 '24 15:10 kannon92

@kannon92: Closed this PR.

In response to this:

/close

We need more design/use cases on this. Going to close for now.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Oct 14 '24 15:10 k8s-ci-robot