kueue
kueue copied to clipboard
[WIP][DRAFT] Add WorkloadSlice Support to Enable Mutable Workloads
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces the foundational implementation of WorkloadSlices in Kueue, as proposed in KEP-77. WorkloadSlices enable controlled scaling of admitted workloads (e.g., scale-up) while preserving Kueue's scheduling guarantees and resource tracking semantics.
📌 Summary
- Introduces WorkloadSlice concept as a transient workload object representing a logical scale-up request.
- Enables mutable workload behavior using a dual-Workload model:
- Original admitted workload.
- A new WorkloadSlice with additional requested capacity (1:2 state during transition).
- Admission of the new WorkloadSlice triggers preemption of the original Workload, even if additional capacity is available, to enforce consistent admission-state transitions.
- Uses Pod scheduling gates (instead of spec.suspend) for gating new pods until the slice is admitted.
- Defaulting logic enables the feature automatically for supported jobs (e.g., batchv1.Job, RayJob).
- Ensures all new pods created during the transition are gated until the corresponding Workload is admitted.
- Aggregates admission state and lifecycle management into the core Workload controller flow.
📎 Additional Notes
- Fully backward-compatible with existing single-Workload flow.
- Includes tests for:
- Slice creation logic.
- Admission/preemption interaction.
- Scheduling gate behavior.
- Documentation and KEP link updates to follow in separate PR.
⚠️ Known Limitations
- Multi-cluster support for WorkloadSlices is still a work in progress and will be addressed in this or follow-up PRs.
Which issue(s) this PR fixes:
Fixes #5528
Special notes for your reviewer:
Does this PR introduce a user-facing change?
This change introduces the foundational support for WorkloadSlices in accordance with KEP-77. The initial implementation targets batch/v1.Job, enabling horizontal scaling through slice-based workload admission, scheduling gate control, and slice preemption handling.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: ichekrygin / name: Illya Chekrygin (4e08884aeb7b41570a86721f3fd77ed515564148, 143d5c502a1ae33f028bd4ab0abc13b74fa96bcf, f63aa55be1867809eb9c78485f0e151a9f40d7a6, fdf77174c843be8816874d5fa9de0b2cf80d1e2d, d9255a2c3e4184995becf1e59dbb01876691aaed, fc49d7acaef9fbd511625e2279177c8d968b14b9, dd4ce880e547e76688a279af96d3df1829842551, 99e54230a14d479ac3ee4655062b7c08563e6e4a, f2869e420a026d1cbddaba082e8a25c9a9dc0583, d8e115fde460f8c5bbc2c0831502e0fa8d37f08f, 9bbccd3addf850cf6f1358565f1e12f36316c514, 0313da532bedfc9a45c2f0480d1d79e9308fd2e3, cabcd8e169535f6fafec72a79e778cff86088f3d, dafd141470cecab6904e08bb9a833cb15e182dcc, 4494d7853f9988640fa3a8d85c6eea1a5bbfa173, fe51aed7019d7c769c96070cf6fe3e1c1ea1981a, df97d77d3bb435207c320c9fa59cb71f6adf791a, de5aaa5cc7d7b94544d9e3cbeb421c6eb2d221e7, ebb7b69164b1b134254aa15bce08f3b3c8535404, c0e72efa2e49052510ad745da425bfcb6cec06bb, c60cfdb8fdd99062ca7d3b2992d3bceb6deb3cfb, af2da4d458e3efa1ec2ac114b2ecd20c190b3bf3, 870705898e4e996891afc82aa300379f8260c150, 11847419c0a9fe2ad73105bac65f6077c4e49808, 75c4f507ebfdc4b8bdc0a23fb0eaacf535b39c68, 0241eb73cf0d63292b8cec0e7507117adf7398ad, 7bf19d61c7cb14646ab8760f5146496a369e6805, 9f1096a0191424824a0134c8eccfff46c1271c8b, 76e5216474fee3a93d5dca27fee0cd29e61b9f56, f1d44095fcd85f0c1c394c6a32faf5cdbeccab30, 0509da1a08f9d406b5a8f79bc59bdb2f4027795a, 5ecbc024a32fdf63ee985bd88b729a75572d31e6, d8164cc56a2e17653d3e1139a9b55723cacf5225, 29e607c8685640418789aa79b2b2ba4bec953e2e, 922a8a03641d71228a5a2cd54029a2c69c57ab0a, 3afbbf4aebb38f547a15062b0f9dcec4b6a9f5a8, 083e71ca42c511d04d90f3d35c840237d0a52795, dcf49353764c4dba5a8923838d188055ddefc32a, a29432a3c7d34fd5282f74bf7f6a8893a9f06f9f, af552b375b29c30df29c7d6b63cf1f3786d1de89, aa2aefd5a4ebe29572b3957a596ab0ede8c8f752)
Hi @ichekrygin. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Deploy Preview for kubernetes-sigs-kueue canceled.
| Name | Link |
|---|---|
| Latest commit | f1d44095fcd85f0c1c394c6a32faf5cdbeccab30 |
| Latest deploy log | https://app.netlify.com/projects/kubernetes-sigs-kueue/deploys/68807c94966dc0000839ef27 |
/ok-to-test
/retest
/retitle Support Elastic Jobs via WorkloadSlices Just to align with the recent naming in the KEP.
/retest
@mimowo, thank you for the detailed and insightful feedback. I reviewed, remedied, and/or replies to all your comments. PTAL when you get a moment.
Thank you @ichekrygin Im pretty confident about the PR as for alpha. Going forward I have a couple of thoughts we may tackle in future iterations, or even the first if you have time:
- what happens if the old workload is preempted by another workload. Then we will end up with two Pending workloads. I'm wondering if it would be better to mark the old as Finished in this case.
- Would the Scale Down support work day one for all other Job types like RayClauser or JobSet?
- I would like to reverse control of unsuspending the pods by a dedicated controller. It would observe Workloads and then for an admitted workload find all the associated pods and ungate a given number of them. This could work for arbitrary Job types. Otherwise we need a lot of integration specific code. A similar problem is solved in TopologyUngater. I'm happy to drive the work on commonizing the approaches.
What happens if the old workload is preempted by another workload? Then we’ll end up with two Pending workloads. I’m wondering if it would be better to mark the old one as Finished in this case.
Good question. The old workload slice shows up as a preemption target because of how flavor assignment works. And yeah, it’s possible that multiple workloads being scheduled could try to preempt the same old slice.
That said, Kueue already handles overlapping preemption targets. If a workload’s preemption target was already finalized earlier in the same scheduling cycle, that workload just gets skipped, whether it’s a new slice or something else.
So if the old slice gets evicted by some workload other than the new one, the new slice will get skipped too, leaving it pending with the new (scaled-up) definition but not admitted. And the reverse is also true, if the new slice evicts the old one, any other workloads that were planning to preempt the old slice will get skipped.
This keeps us from ending up with multiple workloads depending on the same preemption target, and avoids running into conflicting Pending states.
I looked at https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_kueue/5510/pull-kueue-test-integration-baseline-main/1946075894124646400 at it seems like an urelated flake. Note that the integration test does not enable the new feature gate.
Let me retry, if confirmed, I will open an issue /test pull-kueue-test-integration-baseline-main
Opened: https://github.com/kubernetes-sigs/kueue/issues/6018
@ichekrygin I think this is very close to be mergable. I left a bunch of comments, mostly renames to use "replacing" terminology consistently, rather than preemptions, because the mechanism only marginally relies on preemptions.
it would also be great to add integration tests for the happy path. The release is on Friday, so we still have a bit of time to address the comments I think.
Feel free to also squash the commits. There are 33 of them, I highly doubt anyone would like to be traversing them :)
LGTM, but please address the remaining comments
Let's make the note a bit more user-oriented, I think the workload-slices replacement is more of a technical detail. Putting a link to KEP77 is probably enough for interested readers. /release-note-edit
Support for Elastic (Dynamically Sized Jobs) in Alpha as designed in [KEP-77](https://github.com/kubernetes-sigs/kueue/tree/main/keps/77-dynamically-sized-jobs).
The implementation supports resizing (scale up and down) of batch/v1.Job and is behind the Alpha
`ElasticJobsViaWorkloadSlices` feature gate. Jobs which are subject to resizing need to have the
`kueue.x-k8s.io/elastic-job` annotation added at creation time.
/lgtm /approev Thank you for your relentless work on KEP-77 and this implementation PR. This is one of the oldest and most anticipated KEPs in Kueue. While we still have a long way to go (e.g., support for other Job CRDs, MultiKueue, TAS), this is a huge milestone, and I'm very happy to get this in.
FYI @tenzen-y: Since the release is approaching and all of my comments have been addressed, I am merging this now to avoid potential conflicts with other PRs. I've taken extra care to ensure all new code is behind the alpha feature gate. Please feel free to add any further comments or open a new issue for follow-up items. I'm confident we can address them.
LGTM label has been added.
/approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: ichekrygin, mimowo
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [mimowo]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment