autoscaler Stop (un)tainting nodes from unselected node groups.

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a bug where at the start of RunOnce and during scale down, the code cleans DeletionCandidate taints from pretty much all nodes, even from those that are not part of the selected node groups. By selected node groups, I mean the ones passed through the --nodes flag. This is obviously undesired behaviour but I'd like to give a bit more context around how we came across this issue: We are currently running our cluster on GKE which comes with its own cluster autoscaler. It lacks loads of options, so we have deployed our own CA and disabled the managed one on all node pools, as there is no way to completely remove it. This means we have 2 CAs running at the same time, even though one is not supposed to do anything. Given the bug I described above, they both are conflicting because while one (ours) tries to scale down some nodes, the other (GKE's) will remove the DeletionCandidate taint from those as from its perspective those nodes should not be scaled down even though their node groups have not been selected. This makes scale down very slow.

To give you a better idea, with the test case I've added, it fails with the following on master:

                Error Trace:    /home/fische/local/autoscaler/cluster-autoscaler/core/static_autoscaler_test.go:372
                Error:          Not equal:
                                expected: []v1.Taint{v1.Taint{Key:"DeletionCandidateOfClusterAutoscaler", Value:"1699914797", Effect:"PreferNoSchedule", TimeAdded:<nil>}}
                                actual  : []v1.Taint{}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,8 +1,2 @@
                                -([]v1.Taint) (len=1) {
                                - (v1.Taint) {
                                -  Key: (string) (len=36) "DeletionCandidateOfClusterAutoscaler",
                                -  Value: (string) (len=10) "1699914797",
                                -  Effect: (v1.TaintEffect) (len=16) "PreferNoSchedule",
                                -  TimeAdded: (*v1.Time)(<nil>)
                                - }
                                +([]v1.Taint) {
                                 }
                Test:           TestStaticAutoscalerRunOnce

Which issue(s) this PR fixes:

I haven't raised any issue for this. Shall I?

Special notes for your reviewer:

I'm just not sure what we should do if there's an error filtering the nodes during scale down. Shall we continue? I've added a comment in the code in case you don't see what I'm talking about.

Does this PR introduce a user-facing change?

This does somewhat introduce a user-facing change, yes, so I've added a release note, even though it is behaviour that users should not rely on.

Fixed undesired behaviour at the beginning of autoscaler main logic and during scale down, where it would clean taints from nodes that are not part of selected node groups (--nodes flag).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Nov 13 '23 23:11 fische

Welcome @fische!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

Nov 13 '23 23:11 k8s-ci-robot

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: fische / name: Maxime Fischer (4901303bef7fe5659db2193ac14ed100d3a45ac5, 91477aca4a7f97bdb3ef309130300a081490e47e, cc6ecec2d3e0412d4c3b1199ea9bbd4e7742f99c, 7383034c082215abc10d4ff4182e3452aa4f5512, c3810ec199d5d16314ef433023ecd560906300b9, 95ad6c99a290ac1f0ef5f7858d3c4e8b3bf6db68, 368441dcd0e4dc93fcc7e5774537fb53940dbee3, 5f6eedc8f7b9d7621cc760392c87d4af4cc26eb0, 90e24ac528955fe4a4840b81d36918753518cb9f, a47ef89f5a6eefa991858524ca017bfec1900503, e8e3ad0b1f16ae770f3e52c084d8430bf2104e8d)

Nov 13 '23 23:11 linux-foundation-easycla[bot]

@fische, you have to sign the CLA before the PR can be reviewed. See the following document to sign the CLA: Signing Contributor License Agreements(CLA)

Nov 14 '23 07:11 Shubham82

To check EasyCLA

/easycla

Nov 14 '23 07:11 Shubham82

@fische, you have to sign the CLA before the PR can be reviewed. See the following document to sign the CLA: Signing Contributor License Agreements(CLA)

Yup, sorry, I forgot about that.

Nov 14 '23 22:11 fische

@Shubham82 Would you mind taking a look now that the CLA has been signed please? :pray:

Nov 21 '23 17:11 fische

/assign

Dec 28 '23 10:12 x13n

@x13n Could you please have another look?

Feb 02 '24 09:02 fische

Apologies for slow review. LGTM now.

/lgtm /approve

Feb 06 '24 14:02 x13n

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fische, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Feb 06 '24 14:02 k8s-ci-robot

autoscaler autoscaler copied to clipboard

Stop (un)tainting nodes from unselected node groups.

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

autoscaler
autoscaler copied to clipboard