autoscaler Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

Which component are you using?: Cluster-autoscaler

What version of the component are you using?: 1.18.1

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-dispatcher", GitCommit:"2e298c7e992f83f47af60cf4830b11c7370f6668", GitTreeState:"clean", BuildDate:"2019-09-19T22:26:40Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.15-eks-ad4801", GitCommit:"ad4801fd44fe0f125c8d13f1b1d4827e8884476d", GitTreeState:"clean", BuildDate:"2020-10-20T23:27:12Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?: AWS

What happened?: CA lists the nodes, finds an empty node, and decides to scale it down.

The race condition occurs when Kubernetes schedules a new pod on the node after CA made the API call to list all the nodes, but before the node is tainted as Unschedulable.

The consequence is that CA removes the node even though the condition for scale down isn't met. The newly placed pod isn't evicted, because CA only evicts pod it knows about (and here it thinks the node is empty).

What did you expect to happen?: I expect CA wouldn't remove the node in this scenario. If it did remove it, I would expect that it evicts the newly placed pod.

Potential fixes:

After a node is tainted as Unschedulable and ready for removal, make a new API call to list the node and verify that the condition for scale down is met.
Or if we can't do this, it would be nice if all pods were evicted, which we could do by making a new API call after the node is tainted as Unschedulable, to retrieve the freshest list of pods to evict.
If we don't want to add a new API call, then another option could be to delete a node over 2 executions of the main loop. During the first execution, we tain the node as unschedulable / mark it for deletion. But it's only during the second execution (if the conditions are still met), that we trigger its effective deletion.

How to reproduce In our setting, there's about a 1-second time window where this race condition is possible. This race conditions happens a few times a day (we schedule a lot of short-lived etl applications, order of 10k pods a day). I'm not sure how to reproduce this in a test environment.

Feb 14 '21 07:02 jystephan

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

May 15 '21 08:05 fejta-bot

/remove-lifecycle stale

May 15 '21 11:05 jystephan

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 13 '21 12:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Sep 12 '21 12:09 k8s-triage-robot

I ran into a this issue. To consistently replicate the issue, I introduced a 10 second delay before my nodes are tainted toBeDeleted and terminated. Without adding that 10 seconds, below are sequence of events that resulted in the issue. Once it breached the threshold on unneeded time of 3minutes, CA got list of empty nodes and started to scheduleDeleteEmpyNodes and terminated 1/3 nodes identified. This operation to delete node-1 took approx. 2 seconds. During that delay K8s-scheduler scheduled a spark-driver Pod on node-2 which is about to be deleted causing our driver to abruptly get terminated without any eviction resulting in spark job failure.

2021-09-13 09:48:45.897 - cpu utilization 0.019509
2021-09-13 09:48:45.871308291 - unneeded since
2021-09-13 09:51:48.068 - unneeded for 3m2.164596122s
2021-09-13 09:51:53 - Successfully assigned pod-xxxxx/spark-xxxxx-driver to ip-192-168-237-84
2021-09-13 09:51:54.685 - Scale-down: removing empty node ip-192-168-237-84
2021-09-13 09:51:54.885 - Successfully added ToBeDeletedTaint on node ip-192-168-237-84
2021-09-13 09:51:58 - Started image pulling
2021-09-13 09:52:07.758 - Terminating EC2 instance: i-xxxxxxxx
2021-09-13 09:53:13 - Node is not ready

Sep 29 '21 02:09 nithu0115

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Oct 29 '21 13:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 29 '21 13:10 k8s-ci-robot

/reopen

May 31 '22 18:05 alculquicondor

@alculquicondor: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 31 '22 18:05 k8s-ci-robot

/cc @x13n

May 31 '22 18:05 alculquicondor

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Jun 30 '22 19:06 k8s-triage-robot

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jun 30 '22 19:06 k8s-ci-robot

/reopen /remove-lifecycle rotten

Jul 04 '22 11:07 x13n

@x13n: Reopened this issue.

In response to this:

/reopen /remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jul 04 '22 11:07 k8s-ci-robot

I wonder if using NoExecute taint effect instead of NoSchedule would be sufficient to fix this. Perhaps with some configurable delay between tainting and actually removing the node.

Jul 07 '22 08:07 x13n

What does it currently do? Add NoSchedule taint and manually evict the pods? Does it use the eviction API to do that? Would changing to NoExecute change the behavior WRT disruption budgets?

Jul 07 '22 15:07 alculquicondor

Today it applies NoSchedule taint, manually evicts pods using eviction API and then deletes the node. In case of empty nodes, it just applies the taint and deletes the node right away. Manual use of eviction API is actually necessary so that CA can manage timeouts on its own, but after all evictions are done, we could change taint effect to NoExecute to clean up any race condition leftover pods. In case of empty nodes, we would apply NoExecute only.

Jul 08 '22 09:07 x13n

I like that plan: switch to NoExecute once CA thinks it's done cleaning.

Jul 08 '22 15:07 alculquicondor

Ok, I think with that this becomes a fairly well-defined task, let's see if someone would be able to pick it up.

Hopefully the change can be confined just to https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaledown/actuation/actuator.go

Jul 11 '22 11:07 x13n

/help

Jul 11 '22 12:07 x13n

@x13n: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jul 11 '22 12:07 k8s-ci-robot

Hi @x13n! I would like to help with this.

I tried to recreate the scenario with the latest versions of related components and got slightly different results, though. I am not sure about the benefit of the NoExecute taint.

I used my testing EKS cluster with Kubernetes v1.22.10 and cluster-autoscaler v1.23.1.

What I did:

scaled down deployment to initiate deletion of 1 (empty) node
introduced sleep before taint
during the pause, rescaled deployment back up to the original size
introduced sleep [before AWS SDK delete call](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaledown/actuation/actuator.go#L282 )

After step 3, I had a node tainted with ToBeDeletedByClusterAutoscaler=1659355901:NoSchedule waiting for removal and several running pods on it. They executed the following script:

#!/bin/bash
trap "echo 'Gracefully terminating'; sleep 10; echo '--== RIP ==--'; exit 0" SIGTERM
while true; do echo 'tick'; sleep 2; done

During second sleep, they ran undisturbed. After the SDK delete call, the node was tainted with node.kubernetes.io/unschedulable:NoSchedule and .spec.unschedulable=true appeared. That triggered pod deletion; they received SIGTERM and ended as expected. (Btw. Is there an easier/cleaner way to see if pods terminated within a grace period?)

I realize this is not classic eviction. It does not respect PDB. But that is also the case with NoExecute taint, no? The reasons to manually set NoExecute taint anyway would be a) custom delay period and b) safeguard against more aggressive deletes from other cloud providers. But is that good enough?

Or am I missing something? Thanks.

Aug 01 '22 16:08 jan-skarupa

/assign

Aug 01 '22 16:08 jan-skarupa

Hi @jan-skarupa, thanks for looking into this!

Interesting, so it looks like VM deletion is just causing OS to send SIGTERM to kubelet, which then initiates graceful shutdown in 1.21+ clusters. This issue was originally filed for 1.18, which didn't have this feature. I agree in that case the only reason to introduce NoExecute taint would be to delay node deletion further, beyond VM timeout configured by the cloud provider. I'm not sure how useful that would be though.

Aug 09 '22 15:08 x13n

To actually avoid the race condition at all, we would have to separate tainting from drain&deletion. That would require changes to ScaleDown interface so that Actuator would have two separate methods for this and Planner would have to be aware of the taints to decide when to taint and when to actually drain&delete.

Aug 19 '22 12:08 x13n

@jan-skarupa are you up for this? It is definitely a bigger change than just adding NoExecute taint.

Aug 19 '22 12:08 x13n

Interesting, so it looks like VM deletion is just causing OS to send SIGTERM to kubelet, which then initiates graceful shutdown in 1.21+ clusters.

I see. That makes sense.

@jan-skarupa are you up for this? It is definitely a bigger change than just adding NoExecute taint.

Yeah, I will do it. I need to read through the code some more. I will then outline the solution to check we are on the same page. I should get to it by the end of this week.

Aug 22 '22 09:08 jan-skarupa

I talked offline about this with @MaciekPytel. The conclusion we came up with was that it should be both simpler and less risky if Actuator treat empty node becoming non-empty as an error. Triggering drain on the node would be problematic because it is possible the node being deleted is the only place the new pod can run. If we trigger drain without re-running scale down simulation, we may prevent a workload from running and it will have to wait for another scale up. The alternaitve to taint it in one loop iteration and then delete in the next one would solve this problem, but would also significantly slow down scale down of empty nodes in large clusters (CA iterations tend to take more time there).

So, I think a reasonable compromise here would be to change actuation of empty nodes to work as follows:

Taint the nodes.
Sleep for a predefined time (5s? the point here is to wait for pods cache to become up-to-date after tainting the nodes)
Check if there are no non-DS pods on the node, error out (&untaint) if something got scheduled
If there was no error, proceed with VM deletion.

Note: steps 2-4 should be done in a separate goroutine, only tainting is synchronous.

Sep 27 '22 09:09 x13n

autoscaler autoscaler copied to clipboard

Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)

Guidelines

autoscaler
autoscaler copied to clipboard