autoscaler
autoscaler copied to clipboard
Race condition leading to undesired scale nodes of non-empty nodes (without pod eviction)
Which component are you using?: Cluster-autoscaler
What version of the component are you using?: 1.18.1
What k8s version are you using (kubectl version
)?:
kubectl version
Output
$ kubectl version Client Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.11-dispatcher", GitCommit:"2e298c7e992f83f47af60cf4830b11c7370f6668", GitTreeState:"clean", BuildDate:"2019-09-19T22:26:40Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.15-eks-ad4801", GitCommit:"ad4801fd44fe0f125c8d13f1b1d4827e8884476d", GitTreeState:"clean", BuildDate:"2020-10-20T23:27:12Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
What environment is this in?: AWS
What happened?: CA lists the nodes, finds an empty node, and decides to scale it down.
The race condition occurs when Kubernetes schedules a new pod on the node after CA made the API call to list all the nodes, but before the node is tainted as Unschedulable.
The consequence is that CA removes the node even though the condition for scale down isn't met. The newly placed pod isn't evicted, because CA only evicts pod it knows about (and here it thinks the node is empty).
What did you expect to happen?: I expect CA wouldn't remove the node in this scenario. If it did remove it, I would expect that it evicts the newly placed pod.
Potential fixes:
- After a node is tainted as Unschedulable and ready for removal, make a new API call to list the node and verify that the condition for scale down is met.
- Or if we can't do this, it would be nice if all pods were evicted, which we could do by making a new API call after the node is tainted as Unschedulable, to retrieve the freshest list of pods to evict.
- If we don't want to add a new API call, then another option could be to delete a node over 2 executions of the main loop. During the first execution, we tain the node as unschedulable / mark it for deletion. But it's only during the second execution (if the conditions are still met), that we trigger its effective deletion.
How to reproduce In our setting, there's about a 1-second time window where this race condition is possible. This race conditions happens a few times a day (we schedule a lot of short-lived etl applications, order of 10k pods a day). I'm not sure how to reproduce this in a test environment.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
I ran into a this issue. To consistently replicate the issue, I introduced a 10 second delay before my nodes are tainted toBeDeleted
and terminated. Without adding that 10 seconds, below are sequence of events that resulted in the issue. Once it breached the threshold on unneeded time of 3minutes, CA got list of empty nodes and started to scheduleDeleteEmpyNodes
and terminated 1/3 nodes identified. This operation to delete node-1 took approx. 2 seconds. During that delay K8s-scheduler scheduled a spark-driver Pod on node-2 which is about to be deleted causing our driver to abruptly get terminated without any eviction resulting in spark job failure.
2021-09-13 09:48:45.897 - cpu utilization 0.019509
2021-09-13 09:48:45.871308291 - unneeded since
2021-09-13 09:51:48.068 - unneeded for 3m2.164596122s
2021-09-13 09:51:53 - Successfully assigned pod-xxxxx/spark-xxxxx-driver to ip-192-168-237-84
2021-09-13 09:51:54.685 - Scale-down: removing empty node ip-192-168-237-84
2021-09-13 09:51:54.885 - Successfully added ToBeDeletedTaint on node ip-192-168-237-84
2021-09-13 09:51:58 - Started image pulling
2021-09-13 09:52:07.758 - Terminating EC2 instance: i-xxxxxxxx
2021-09-13 09:53:13 - Node is not ready
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen
@alculquicondor: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/cc @x13n
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
@k8s-triage-robot: Closing this issue.
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue or PR with
/reopen
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/reopen /remove-lifecycle rotten
@x13n: Reopened this issue.
In response to this:
/reopen /remove-lifecycle rotten
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I wonder if using NoExecute taint effect instead of NoSchedule would be sufficient to fix this. Perhaps with some configurable delay between tainting and actually removing the node.
What does it currently do? Add NoSchedule taint and manually evict the pods? Does it use the eviction API to do that? Would changing to NoExecute change the behavior WRT disruption budgets?
Today it applies NoSchedule taint, manually evicts pods using eviction API and then deletes the node. In case of empty nodes, it just applies the taint and deletes the node right away. Manual use of eviction API is actually necessary so that CA can manage timeouts on its own, but after all evictions are done, we could change taint effect to NoExecute to clean up any race condition leftover pods. In case of empty nodes, we would apply NoExecute only.
I like that plan: switch to NoExecute once CA thinks it's done cleaning.
Ok, I think with that this becomes a fairly well-defined task, let's see if someone would be able to pick it up.
Hopefully the change can be confined just to https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaledown/actuation/actuator.go
/help
@x13n: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi @x13n! I would like to help with this.
I tried to recreate the scenario with the latest versions of related components and got slightly different results, though. I am not sure about the benefit of the NoExecute
taint.
I used my testing EKS cluster with Kubernetes v1.22.10
and cluster-autoscaler v1.23.1
.
What I did:
- scaled down deployment to initiate deletion of 1 (empty) node
- introduced sleep before taint
- during the pause, rescaled deployment back up to the original size
- introduced sleep [before AWS SDK delete call](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaledown/actuation/actuator.go#L282 )
After step 3, I had a node tainted with ToBeDeletedByClusterAutoscaler=1659355901:NoSchedule
waiting for removal and several running pods on it. They executed the following script:
#!/bin/bash
trap "echo 'Gracefully terminating'; sleep 10; echo '--== RIP ==--'; exit 0" SIGTERM
while true; do echo 'tick'; sleep 2; done
During second sleep, they ran undisturbed. After the SDK delete call, the node was tainted with node.kubernetes.io/unschedulable:NoSchedule
and .spec.unschedulable=true
appeared.
That triggered pod deletion; they received SIGTERM and ended as expected. (Btw. Is there an easier/cleaner way to see if pods terminated within a grace period?)
I realize this is not classic eviction. It does not respect PDB. But that is also the case with NoExecute
taint, no? The reasons to manually set NoExecute
taint anyway would be a) custom delay period and b) safeguard against more aggressive deletes from other cloud providers. But is that good enough?
Or am I missing something? Thanks.
/assign
Hi @jan-skarupa, thanks for looking into this!
Interesting, so it looks like VM deletion is just causing OS to send SIGTERM to kubelet, which then initiates graceful shutdown in 1.21+ clusters. This issue was originally filed for 1.18, which didn't have this feature. I agree in that case the only reason to introduce NoExecute
taint would be to delay node deletion further, beyond VM timeout configured by the cloud provider. I'm not sure how useful that would be though.
To actually avoid the race condition at all, we would have to separate tainting from drain&deletion. That would require changes to ScaleDown interface so that Actuator would have two separate methods for this and Planner would have to be aware of the taints to decide when to taint and when to actually drain&delete.
@jan-skarupa are you up for this? It is definitely a bigger change than just adding NoExecute
taint.
Interesting, so it looks like VM deletion is just causing OS to send SIGTERM to kubelet, which then initiates graceful shutdown in 1.21+ clusters.
I see. That makes sense.
@jan-skarupa are you up for this? It is definitely a bigger change than just adding
NoExecute
taint.
Yeah, I will do it. I need to read through the code some more. I will then outline the solution to check we are on the same page. I should get to it by the end of this week.
I talked offline about this with @MaciekPytel. The conclusion we came up with was that it should be both simpler and less risky if Actuator treat empty node becoming non-empty as an error. Triggering drain on the node would be problematic because it is possible the node being deleted is the only place the new pod can run. If we trigger drain without re-running scale down simulation, we may prevent a workload from running and it will have to wait for another scale up. The alternaitve to taint it in one loop iteration and then delete in the next one would solve this problem, but would also significantly slow down scale down of empty nodes in large clusters (CA iterations tend to take more time there).
So, I think a reasonable compromise here would be to change actuation of empty nodes to work as follows:
- Taint the nodes.
- Sleep for a predefined time (5s? the point here is to wait for pods cache to become up-to-date after tainting the nodes)
- Check if there are no non-DS pods on the node, error out (&untaint) if something got scheduled
- If there was no error, proceed with VM deletion.
Note: steps 2-4 should be done in a separate goroutine, only tainting is synchronous.