karpenter icon indicating copy to clipboard operation
karpenter copied to clipboard

feat: disruption controls by reason

Open Bryce-Soghigian opened this issue 1 year ago • 7 comments

Fixes #924

Description This PR implements the design for disruption-controls-by-reason by introducing a new field Reasons into the v1beta1 Nodepool API. This change allows users to define lists of reasons alongside their node disruption budgets.

These reasons that can be defined are "underutilized", "drifted", "emptied" and "expired" which all correlate with different disruption methods.

How was this change tested?

  • Kwok Provider Scale up and scale down, validated that budgets were restricted
  • Parity with existing Budgets feature via no modified budgets tests, and code still behaves properly
  • Tests for DisruptionBudgetMapping structure
  • make presubmit

TODO

  • PR To Docs
  • Test on AKS Provider

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Bryce-Soghigian avatar Feb 06 '24 08:02 Bryce-Soghigian

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

k8s-ci-robot avatar Feb 06 '24 08:02 k8s-ci-robot

Pull Request Test Coverage Report for Build 9520161962

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 103 of 112 (91.96%) changed or added relevant lines in 12 files are covered.
  • 7 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.09%) to 81.197%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/apis/v1beta1/nodepool.go 14 15 93.33%
pkg/controllers/disruption/multinodeconsolidation.go 5 6 83.33%
pkg/controllers/disruption/singlenodeconsolidation.go 4 5 80.0%
pkg/controllers/disruption/helpers.go 48 50 96.0%
pkg/apis/v1beta1/zz_generated.deepcopy.go 1 5 20.0%
<!-- Total: 103 112
Files with Coverage Reduction New Missed Lines %
pkg/test/expectations/expectations.go 2 93.69%
pkg/controllers/disruption/expiration.go 2 89.66%
pkg/apis/v1beta1/nodepool.go 3 95.24%
<!-- Total: 7
Totals Coverage Status
Change from base Build 9487761453: -0.09%
Covered Lines: 8313
Relevant Lines: 10238

💛 - Coveralls

coveralls avatar Feb 11 '24 23:02 coveralls

However, I don't think the existing "Default" shared budget idea meshes well with that. Wondering if that could be rephased as something like MaxNodePoolDisruption, and always taken into account as an overall shared max disruption regardless of what the individual reasons disruptions were. I think this area need a bit more design to finalize.

Default now is just when its unspecified and meshes with the idea fine imo. I updated the design doc to explain this relationship a bit better. LMK if you have any strong opinions

Bryce-Soghigian avatar Feb 18 '24 02:02 Bryce-Soghigian

@Bryce-Soghigian Another use-case that I've heard and something to consider here: Users who have image updates that are rolled out pretty aggressively and, thus, would experience high churn for their nodes if they didn't specifically block drift, but are open to having nodes still be rolled by expiry. This is the first use-case that I've heard where someone wants to decouple the expiration and drift reasons.

jonathan-innis avatar Feb 29 '24 17:02 jonathan-innis

/retest

Bryce-Soghigian avatar Apr 06 '24 19:04 Bryce-Soghigian

/assign @njtran

jonathan-innis avatar Apr 16 '24 18:04 jonathan-innis

~/dev/focus/karpenter-core/pkg/controllers/disruption (bsoghigian/disruption-by-method-poc) » K8S_VERSION="1.25.x" ginkgo -focus="Consolidation Timeout" sillygoose@Bryces-MacBook-Pro Running Suite: Disruption - /Users/sillygoose/dev/focus/karpenter-core/pkg/controllers/disruption

Random Seed: 1717713256

Will run 2 of 192 specs SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS••SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS

Ran 2 of 192 Specs in 14.999 seconds

Timeout suite passes locally, https://github.com/kubernetes-sigs/karpenter/actions/runs/9408422622/job/25916368437?pr=991#step:5:196 fails in CI. Let me investigate to see if its just flake and why its flakey SUCCESS! -- 2 Passed | 0 Failed | 0 Pending | 190 Skipped PASS

Ginkgo ran 1 suite in 20.270983625s Test Suite Passed

Bryce-Soghigian avatar Jun 06 '24 22:06 Bryce-Soghigian

/retest

Bryce-Soghigian avatar Jun 06 '24 22:06 Bryce-Soghigian

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jun 08 '24 06:06 k8s-ci-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Bryce-Soghigian, njtran

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar Jun 17 '24 21:06 k8s-ci-robot

🥂

ellistarn avatar Jun 17 '24 21:06 ellistarn

@jonathan-innis which release is this scheduled for? This behavior is exactly what we need also :)

jeff-1amstudios avatar Aug 12 '24 23:08 jeff-1amstudios