autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

DeletionCandidateOfClusterAutoscaler soft taint remaining on nodes indefinitely

Open rakechill opened this issue 8 months ago • 19 comments

Which component are you using?: cluster-autoscaler

What version of the component are you using?:

Component version: Observed on CAS 1.30+

What k8s version are you using (kubectl version)?: 1.30+

kubectl version Output
$ kubectl version

What environment is this in?: Azure, but was observed by Openshift customers, as well.

What did you expect to happen?: I expected that after a soft taint DeletionCandidateOfClusterAutoscaler:PreferNoSchedule was added to a node that either 1) the node would be eventually deleted or 2) the taint would be eventually removed.

What happened instead?: The taint gets added to underutilized nodes + then never gets removed in future loops. As far as I can tell, the only way that these taints can get removed is if 1) you have soft tainting disabled or 2) you have an autoscaling loop where scale down is considered, but no node is deleted.

This leads to high CPU/memory utilization on nodes without this taint + little to no workloads being scheduled on the tainted nodes.

How to reproduce it (as minimally and precisely as possible): Seems to happen most often during upgrades for us since another component aside from CAS is creating and deleting nodes.

Anything else we need to know?: This change -- https://github.com/kubernetes/autoscaler/pull/6273 -- updated some of the soft tainting logic, but we weren't able to definitively say it created this behavior. For now, it seems like a red herring.

rakechill avatar Mar 24 '25 03:03 rakechill

I also encountered a similar problem here. When automatically scaling down, DeletionCandidateOfClusterAutoscaler will be tainted on multiple nodes. After the scaling down is completed, these tainted labels will still be retained. For example: when 3 nodes are scaled down to 2 nodes, the DeletionCandidateOfClusterAutoscaler taint label will be marked on 2 nodes. After scaling down to 2 nodes, 1 of the nodes still retains this taint information. When 5 nodes are scaled down to 4 nodes, the DeletionCandidateOfClusterAutoscaler taint label will be marked on 3 nodes. After scaling down to 4 nodes, 2 of the nodes still retain this taint information. This situation exists in both k8s versions 1.30.5 and 1.31.5

zetuchen avatar Mar 24 '25 09:03 zetuchen

@zetuchen what cloud provider are you using?

rakechill avatar Mar 24 '25 13:03 rakechill

/area cluster-autoscaler

adrianmoisey avatar Mar 24 '25 15:03 adrianmoisey

Tagging a few other folks related to this issue:

  • @jackfrancis from azure
  • @elmiko, @maxcao13 from openshift
  • @x13n from google. can you tag the google eng that were working on this?

rakechill avatar Mar 24 '25 15:03 rakechill

I did a cursory code audit and it looks like the soft taints are being sanely managed (both create and delete taints, if appropriate) based on a calculation of whether or not nodes in the cluster are needed or not. That calculation occurs with every CAS loop, so if something changes between the initial soft-tainting of a node (for example it's running a small number of workloads that could result in a rebalancing optimization) and the next time the CAS loop runs (for example more workloads landed, or VPA scaled up and the node is now less idle) then there should be a follow-up calculation that determines that that node is in fact needed, and the soft taint should be removed.

It's possible that there is an edge case being reached in the logic to determine unneeded nodes?

Does that high level analysis sound right @x13n?

Also, a near term fix could be setting the --max-bulk-soft-taint-count runtime config to 0 (default is 10). Not sure if we were aware there is already a flag for disabling the soft taint feature altogether.

Just to be clear, the purpose of soft tainting is to optimize for minimal node count over time. We only soft taint a node after an attempt to scale it down has not succeeded for any reason. This is to anticipate that the next CAS loop will execute in the near future, and a follow-up scale down will be attempted against the same node — and, importantly, we want any new workloads to land on other nodes in the cluster so that we don't have unnecessary cordon/drain thrashing.

To answer these questions from the issue description:

The taint gets added to underutilized nodes + then never gets removed in future loops. As far as I can tell, the only way that these taints can get removed is if 1) you have soft tainting disabled or 2) you have an autoscaling loop where scale down is considered, but no node is deleted.

In fact the taints are removed only when the "UnneededNodes" logic determines that a previously soft-tainted node is in fact needed. If a scale down is attempted but the node is not deleted, then that node will be soft-tainted.

jackfrancis avatar Mar 24 '25 16:03 jackfrancis

In fact the taints are removed only when the "UnneededNodes" logic determines that a previously soft-tainted node is in fact needed

Agreed that this is how things should operate. However, we're seeing an edge case:

Let's say min count is 2 and node group curr size is 3. All three nodes are underutilized, but one is unremovable due to PDB and another is "not found" because it's currently being deleted (not be CAS).

CAS determines that we are above min count, so this node group will not be skipped. The "not found" node is skipped and the node w/ blocking PDB is determined as needed. The third node is determined to be removable.

When StartDeletion() is called, some (indiscernible to me) calculation is happening where this node cannot be deleted due to min count consideration. So, the resulting status is ScaleDownNoNodeDeleted.

We enter the soft tainting logic + add a DeletionCandidate taint to the third node that was unneeded.

In the next loop, we are at min count (skipped node has finished deleting) and we skip this node group for scale down. The taint is still there.

rakechill avatar Mar 24 '25 16:03 rakechill

I think is an edge case caused by the fact that when the only autoscaled nodegroup is at min size, scale down is on cooldown. https://github.com/kubernetes/autoscaler/pull/7954 should fix this edge case hopefully. It just merged today. Can you check if it solves this issue as well?

CC @abdelrahman882 @BigDarkClown

x13n avatar Mar 24 '25 17:03 x13n

That does look like the correct solution to me. Can we cherry-pick it into specific upstream release branches? Say, 1.30+?

rakechill avatar Mar 24 '25 17:03 rakechill

@zetuchen what cloud provider are you using?

I have this problem in the Azure cloud.

zetuchen avatar Mar 25 '25 03:03 zetuchen

i have a feeling this might be related to the counting changes i have been making in the Nodes and DecreaseTargetSize interface functions.

edit: if we are seeing this on other platforms, then it isn't related to my clusterapi changes.

elmiko avatar Mar 25 '25 15:03 elmiko

I do think the ScaleDownInCoolDown change should fix our issues, but we need to have testing to exercise this code path. Currently, the only testing that looks at these taints is from this change in 1.30.0 --> https://github.com/kubernetes/autoscaler/pull/6273

But it doesn't properly exercise the case where taints are added + never removed even if we're in cool down + thus can't actually scale down nodes.

A simple test could be:

  • starting RunOnce() at min count w/ some tainted nodes,
  • asserting we go into cool down
  • asserting that the taints are released

rakechill avatar Mar 25 '25 15:03 rakechill

Chiming in to mention I'm running into this issue on version 9.46.6 of the Helm Chart in AWS.

The timeline of events is the same as what's been described here, our current node count is close to the minimum size, but occasionally more nodes are underutilized than can be removed, so we end up with extra soft tainted nodes.

Is disabling the soft taint logic still the recommended way to deal with this? It looks like the consequence of this is that if we have multiple underutilized nodes, there's a chance for the evicted pods to end up on nodes that will be deleted in the near future.

alexambarch avatar May 07 '25 17:05 alexambarch

The fix is encapsulated in 4 PRs on the master branch. We're currently cherry-picking a few PRs to autoscaler release branches (1.30+). We'll release these changes for managed customers of Azure.

@alexambarch you may need to reach out AWS folks directly to get it fixed on their end.

rakechill avatar May 07 '25 17:05 rakechill

I believe we have another scenario where the DeletionCandidateOfClusterAutoscaler soft taint poses a problem that doesn’t involve cool down.

We use the —cores-total option to set a minimum number of cores in a cluster to preserve a specific amount of excess capacity for bursting and fail-over.

Similar to what @rakechill describes, the CA applies the DeletionCandidateOfClusterAutoscaler soft taint to nodes that it determines are underutilized - including nodes that, if the CA were to scale down, would bring the total CPU count below the —cores-total minimum value.

Even though the CA skips scaling these nodes down, it does not remove the DeletionCandidateOfClusterAutoscaler taint.

Skipping 10.0.17.228 - minimal limit exceeded for [cpu]
Skipping 10.0.17.105 - minimal limit exceeded for [cpu]
Skipping 10.0.17.123 - minimal limit exceeded for [cpu]

The net effect of the current behavior is that it seems to defeat the purpose of specifying a minimum number of cores since they are effectively unusable on tainted nodes.

jlamillan avatar Jun 06 '25 17:06 jlamillan

@jlamillan is something as simple as this sufficient?

$ git diff
diff --git a/cluster-autoscaler/core/static_autoscaler.go b/cluster-autoscaler/core/static_autoscaler.go
index 95285b828..c94f8120b 100644
--- a/cluster-autoscaler/core/static_autoscaler.go
+++ b/cluster-autoscaler/core/static_autoscaler.go
@@ -681,6 +681,9 @@ func (a *StaticAutoscaler) updateSoftDeletionTaints(allNodes []*apiv1.Node) {
                // nodes from selected nodes.
                taintableNodes = intersectNodes(selectedNodes, taintableNodes)
                untaintableNodes := subtractNodes(selectedNodes, taintableNodes)
+               for _, unremovableNode := range a.scaleDownPlanner.UnremovableNodes() {
+                       untaintableNodes = subtractNodes(untaintableNodes, []*apiv1.Node{unremovableNode.Node})
+               }
                actuation.UpdateSoftDeletionTaints(a.AutoscalingContext, taintableNodes, untaintableNodes)
        }
 }

jackfrancis avatar Jun 06 '25 18:06 jackfrancis

@jlamillan is something as simple as this sufficient?

I don't believe so. The minimum cpu/memory scenario I described above seems to be about the taintable (unneeded) node list.

Specifically, nodes whose removal would cause the cluster to go below the specified minimum CPU or memory, are considered taintable and are not considered unremovable.

jlamillan avatar Jun 07 '25 00:06 jlamillan

I think to address this one we would need scaledown.Planner to do resource-related checks in addition to re-scheduling simulation when determining unneeded nodes. Right now it only checks them afterwards, so some nodes can stay unneeded indefinitely.

x13n avatar Jun 09 '25 15:06 x13n

I think to address this one we would need scaledown.Planner to do resource-related checks in addition to re-scheduling simulation when determining unneeded nodes. Right now it only checks them afterwards, so some nodes can stay unneeded indefinitely.

@x13n Is there a tracking issue for this?

jlamillan avatar Jun 18 '25 23:06 jlamillan

Not that I'm aware of, would you mind creating one? It seems a different error leading to the same symptoms as this one, so may be worth tracking separately.

x13n avatar Jun 20 '25 04:06 x13n

Another option for mitigation if you're able to update new + existing workloads is to add this taint toleration:

tolerations:
  - key: "DeletionCandidateOfClusterAutoscaler"
    operator: "Exists"
    effect: "PreferNoSchedule"

This toleration will essentially disable the effects of this taint, which cluster-autoscaler uses to slowly shift workloads from these nodes.

The only caveat is that this taint, when applied properly, is used by CAS to expedite scale down when underutilization is observed. The goal of this taint is to slowly shift workloads from a node such that eventually it will be empty and removable if the nodepool is able to be scaled down.

rakechill avatar Jul 23 '25 18:07 rakechill

@rakechill I've noticed this taint on the nodes. I'm not sure using toleration is the best workaround since it requires modifying user workloads. what’s the current status of this issue? Is there a proper fix planned for this?

chuirang avatar Aug 21 '25 02:08 chuirang

@rakechill One of my customers continues to experience issues after upgrading their AKS cluster from version .31.5 to 1.32.6.. Specifically, 10 out of 5 nodes are unexpectedly tainted with DeletionCandidateOfClusterAutoscaler=1755145124:PreferNoSchedule, which prevents new pods from being scheduled on those nodes and leads to resource overcommitment on others.

So as of current the only workaround is below?

  • Set the scale-down utilization threshold to 0
  • Manually remove the taint

When do you think the permanent fix will be coming?

wayden88 avatar Aug 27 '25 14:08 wayden88

@rakechill - our application faced a same issue once. With azure vmss.

Having default min node as 2. But during memory pressure in node 1 pods failed to move to node 2 due to this taint.

Anyway this issue never happened again so far but waiting for a permanent fix. Thanks

ramsreenoj avatar Aug 28 '25 15:08 ramsreenoj

We are experiencing the same issue. We solved with applying the toleration in our deployment but that is very patchy of course. Hoping for a fix soon!

ecare-matthias avatar Sep 04 '25 11:09 ecare-matthias

For all AKS customers, this bug fix is included in the AKS release current rolling out: https://github.com/Azure/AKS/releases/tag/2025-08-29

The release notes don't currently reflect this, but they're being added in a PR here: https://github.com/Azure/AKS/pull/5254

No action will be required by customers. AKS will update the cluster-autoscaler deployment to be running the new image in the background.

rakechill avatar Sep 04 '25 16:09 rakechill

We encountered the same issue on AWS EKS. When the number of nodes in an EKS managed node group was reduced to 2 (the minimum size for EKS managed node groups), one node persistently retained the soft taint “DeletionCandidateOfClusterAutoscaler”. Is there a fix for AWS EKS? Cluster Autoscaler version: 1.32.1 EKS cluster version is 1.33

normalzzz avatar Sep 30 '25 07:09 normalzzz