autoscaler
autoscaler copied to clipboard
Stop (un)tainting nodes from unselected node groups.
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a bug where at the start of RunOnce
and during scale down, the code cleans DeletionCandidate taints from pretty much all nodes, even from those that are not part of the selected node groups. By selected node groups, I mean the ones passed through the --nodes
flag. This is obviously undesired behaviour but I'd like to give a bit more context around how we came across this issue:
We are currently running our cluster on GKE which comes with its own cluster autoscaler. It lacks loads of options, so we have deployed our own CA and disabled the managed one on all node pools, as there is no way to completely remove it. This means we have 2 CAs running at the same time, even though one is not supposed to do anything. Given the bug I described above, they both are conflicting because while one (ours) tries to scale down some nodes, the other (GKE's) will remove the DeletionCandidate taint from those as from its perspective those nodes should not be scaled down even though their node groups have not been selected. This makes scale down very slow.
To give you a better idea, with the test case I've added, it fails with the following on master:
Error Trace: /home/fische/local/autoscaler/cluster-autoscaler/core/static_autoscaler_test.go:372
Error: Not equal:
expected: []v1.Taint{v1.Taint{Key:"DeletionCandidateOfClusterAutoscaler", Value:"1699914797", Effect:"PreferNoSchedule", TimeAdded:<nil>}}
actual : []v1.Taint{}
Diff:
--- Expected
+++ Actual
@@ -1,8 +1,2 @@
-([]v1.Taint) (len=1) {
- (v1.Taint) {
- Key: (string) (len=36) "DeletionCandidateOfClusterAutoscaler",
- Value: (string) (len=10) "1699914797",
- Effect: (v1.TaintEffect) (len=16) "PreferNoSchedule",
- TimeAdded: (*v1.Time)(<nil>)
- }
+([]v1.Taint) {
}
Test: TestStaticAutoscalerRunOnce
Which issue(s) this PR fixes:
I haven't raised any issue for this. Shall I?
Special notes for your reviewer:
I'm just not sure what we should do if there's an error filtering the nodes during scale down. Shall we continue? I've added a comment in the code in case you don't see what I'm talking about.
Does this PR introduce a user-facing change?
This does somewhat introduce a user-facing change, yes, so I've added a release note, even though it is behaviour that users should not rely on.
Fixed undesired behaviour at the beginning of autoscaler main logic and during scale down, where it would clean taints from nodes that are not part of selected node groups (--nodes flag).
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:
Welcome @fische!
It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.
You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.
You can also check if kubernetes/autoscaler has its own contribution guidelines.
You may want to refer to our testing guide if you run into trouble with your tests not passing.
If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!
Thank you, and welcome to Kubernetes. :smiley:
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: fische / name: Maxime Fischer (4901303bef7fe5659db2193ac14ed100d3a45ac5, 91477aca4a7f97bdb3ef309130300a081490e47e, cc6ecec2d3e0412d4c3b1199ea9bbd4e7742f99c, 7383034c082215abc10d4ff4182e3452aa4f5512, c3810ec199d5d16314ef433023ecd560906300b9, 95ad6c99a290ac1f0ef5f7858d3c4e8b3bf6db68, 368441dcd0e4dc93fcc7e5774537fb53940dbee3, 5f6eedc8f7b9d7621cc760392c87d4af4cc26eb0, 90e24ac528955fe4a4840b81d36918753518cb9f, a47ef89f5a6eefa991858524ca017bfec1900503, e8e3ad0b1f16ae770f3e52c084d8430bf2104e8d)
@fische, you have to sign the CLA before the PR can be reviewed. See the following document to sign the CLA: Signing Contributor License Agreements(CLA)
To check EasyCLA
/easycla
@fische, you have to sign the CLA before the PR can be reviewed. See the following document to sign the CLA: Signing Contributor License Agreements(CLA)
Yup, sorry, I forgot about that.
@Shubham82 Would you mind taking a look now that the CLA has been signed please? :pray:
/assign
@x13n Could you please have another look?
Apologies for slow review. LGTM now.
/lgtm /approve
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: fische, x13n
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~cluster-autoscaler/OWNERS~~ [x13n]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment