autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Node is terminated to early when scale-down-unneeded-time is set to 10m

Open mohanisch-sixt opened this issue 1 year ago • 11 comments

Which component are you using?: cluster-autoscaler

What version of the component are you using?: 1.27.1 / Chart 9.29.0

Component version:

What k8s version are you using (kubectl version)?: "v1.24.14-eks-c12679a

What environment is this in?:

AWS EKS

What did you expect to happen?: Node is terminated only after 10 minutes, after it has been marked as no longer needed

What happened instead?: Node is terminated earlier than expected, e.g. after 2 minuted

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: In our cluster we are running Jenkins with K8s agents. Sometimes we have jobs which have no resource consumption as they are waiting for other jobs or just doing some things which has low resource consumption. We monitored this a long time and figured out, that a value of 0.06 for scale-down-utilization-threshold is working good for us as a node which has nothing todo has a value of 0.053.. . In cases where a pod is scheduled which is "just running", we have this utilisation as well and it happens, that the node got a marker as unneeded. In some cases these nodes are terminated after less than 10 minutes although 10 minutes waiting time is configured.

One example:

I0712 06:30:07.697717       1 klogx.go:87] Node ip-172-25-12-34.eu-central-1.compute.internal - cpu utilization 0.053729
I0712 06:30:07.697837       1 cluster.go:155] ip-172-25-16-25.eu-central-1.compute.internal for removal
I0712 06:31:49.738129       1 nodes.go:126] ip-172-25-12-34.eu-central-1.compute.internal was unneeded for 1m42.382742246s

After last line there is no newer information like "node termianted" or something. It is just gone.

CA is configured as followed:

    skip-nodes-with-local-storage: true
    skip-nodes-with-custom-controller-pods: true
    cordon-node-before-terminating: true
    scale-down-utilization-threshold: 0.06
    scan-interval: 10s
    scale-down-unneeded-time: 10m
    skip-nodes-with-system-pods: true
    max-empty-bulk-delete: 2

mohanisch-sixt avatar Jul 14 '23 07:07 mohanisch-sixt