[EKS] [bug]: auto-scaling group ends up in a bad state after `kubectl delete node`
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Tell us about your request
Currently whenever kubectl delete node command is ran in the cluster - node is removed from k8s, but the EC2 instance behind the node is not terminated. As a result AWS auto-scaling group behind k8s node group does not create new EC2 instances, which also breaks things like cluster auto-scaler.
An example would look like this:
- In the initial setup you have ASG with 5 EC2 instances (desired size is 5), all onboarded as nodes on k8s cluster.
-
kubectl detele nodecommand runs in the cluster, removing a single node - ASG still has "desired size = 5", yet opening the "nodes" tab you can see only 4 nodes.
- Since a node was removed, auto-scaling controller may decided to ask for an additional node to be created (e.g. to handle scale-up)
- Yet this request would not be handed by ASG, because according to it there are already 5 nodes available.
The only way I know of how to resolve the situation is to MANUALLY find out EC2 instance that is no longer mapped to a node in k8s cluster and terminate it, then ASG would pick this information up and continue handling auto-scaler requests.
Which service(s) is this request for? EKS, ASG
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
There is no particular need to use kubectl delete node, but having this behavior in the system is very dangerous. I ended up in this situation because I wanted to get rid of nodes that seemed to be poisoned (pods running on these were performing worse than pods of same service running on all other nodes in the cluster) - it turned out the issue was totally unrelated, but in doing kubectl delete node I messed up the cluster and put it into a bad state that required a fair amount of effort to get to the bottom of.
Are you currently working around this issue? Yes, manually deleting EC2 instance is a viable workaround
Additional context You can see more details in:
- this StackOverflow thread where a different person stumbled on this problem before: https://stackoverflow.com/questions/57554812/my-nodes-got-deleted-in-eks-how-can-i-recover
- In this support request where there are steps from support engineer who reproduced the problem: https://support.console.aws.amazon.com/support/home?region=us-east-1#/case/?displayId=10511080081
Should we expect Managed Node Groups to be able to figure out that difference and cycle (terminate and create a new one) the deleted nodes?
I believe yes, the deleted node should be terminated in this scenario. To be clear - there are ways already in the ecosystem to safely remove an EC2 instance from the cluster by cordoning the node and then detaching it from the ASG; so if the user explicitly asked to delete a node - it seems totally reasonable that the EC2 instance behind it is also deleted, and most importantly the node group itself remains "healthy" (e.g. does not prevent scaling up).
yes there should, my question was more toward AWS adding this feature, which I think can be optionally enabled.
Hitting the same issue here.
For completeness, a kubectl delete node xxx on either GCP or Azure will actually terminate the backing VM as well, allowing for complete node management from within kubernetes.
In our case, the culprit was a setting of ASG Desired = 1, where we had the described unwelcome behavior. It appears we did not have this behavior with Desired = 0
I have node group with "auto repair" enabled
In aws web ui it says: "The node auto repair feature reacts to the Ready condition of the kubelet and any node object manual deletions"
But still, the ec2 instance is not deleted when I delete the node in kubernetes