karpenter
karpenter copied to clipboard
feat: disruption controls by reason
Fixes #924
Description
This PR implements the design for disruption-controls-by-reason by introducing a new field Reasons
into the v1beta1 Nodepool API. This change allows users to define lists of reasons alongside their node disruption budgets.
These reasons that can be defined are "underutilized", "drifted", "emptied" and "expired" which all correlate with different disruption methods.
How was this change tested?
- Kwok Provider Scale up and scale down, validated that budgets were restricted
- Parity with existing Budgets feature via no modified budgets tests, and code still behaves properly
- Tests for DisruptionBudgetMapping structure
- make presubmit
TODO
- PR To Docs
- Test on AKS Provider
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
Pull Request Test Coverage Report for Build 9520161962
Warning: This coverage report may be inaccurate.
This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
- For more information on this, see Tracking coverage changes with pull request builds.
- To avoid this issue with future PRs, see these Recommended CI Configurations.
- For a quick fix, rebase this PR at GitHub. Your next report should be accurate.
Details
- 103 of 112 (91.96%) changed or added relevant lines in 12 files are covered.
- 7 unchanged lines in 3 files lost coverage.
- Overall coverage decreased (-0.09%) to 81.197%
Changes Missing Coverage | Covered Lines | Changed/Added Lines | % |
---|---|---|---|
pkg/apis/v1beta1/nodepool.go | 14 | 15 | 93.33% |
pkg/controllers/disruption/multinodeconsolidation.go | 5 | 6 | 83.33% |
pkg/controllers/disruption/singlenodeconsolidation.go | 4 | 5 | 80.0% |
pkg/controllers/disruption/helpers.go | 48 | 50 | 96.0% |
pkg/apis/v1beta1/zz_generated.deepcopy.go | 1 | 5 | 20.0% |
<!-- | Total: | 103 | 112 |
Files with Coverage Reduction | New Missed Lines | % |
---|---|---|
pkg/test/expectations/expectations.go | 2 | 93.69% |
pkg/controllers/disruption/expiration.go | 2 | 89.66% |
pkg/apis/v1beta1/nodepool.go | 3 | 95.24% |
<!-- | Total: | 7 |
Totals | |
---|---|
Change from base Build 9487761453: | -0.09% |
Covered Lines: | 8313 |
Relevant Lines: | 10238 |
💛 - Coveralls
However, I don't think the existing "Default" shared budget idea meshes well with that. Wondering if that could be rephased as something like MaxNodePoolDisruption, and always taken into account as an overall shared max disruption regardless of what the individual reasons disruptions were. I think this area need a bit more design to finalize.
Default now is just when its unspecified and meshes with the idea fine imo. I updated the design doc to explain this relationship a bit better. LMK if you have any strong opinions
@Bryce-Soghigian Another use-case that I've heard and something to consider here: Users who have image updates that are rolled out pretty aggressively and, thus, would experience high churn for their nodes if they didn't specifically block drift, but are open to having nodes still be rolled by expiry. This is the first use-case that I've heard where someone wants to decouple the expiration and drift reasons.
/retest
/assign @njtran
~/dev/focus/karpenter-core/pkg/controllers/disruption (bsoghigian/disruption-by-method-poc) » K8S_VERSION="1.25.x" ginkgo -focus="Consolidation Timeout" sillygoose@Bryces-MacBook-Pro Running Suite: Disruption - /Users/sillygoose/dev/focus/karpenter-core/pkg/controllers/disruption
Random Seed: 1717713256
Will run 2 of 192 specs SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS••SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
Ran 2 of 192 Specs in 14.999 seconds
Timeout suite passes locally, https://github.com/kubernetes-sigs/karpenter/actions/runs/9408422622/job/25916368437?pr=991#step:5:196 fails in CI. Let me investigate to see if its just flake and why its flakey SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 190 Skipped PASS
Ginkgo ran 1 suite in 20.270983625s Test Suite Passed
/retest
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: Bryce-Soghigian, njtran
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [njtran]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
🥂
@jonathan-innis which release is this scheduled for? This behavior is exactly what we need also :)