autoscaler Pods keep getting evicted when they shouldn't

Pods keep getting evicted when they shouldn't

Open alex-hempel opened this issue 4 months ago • 2 comments

We have the following scenario:

We run gitlab-runner in Kubernetes (EKS, K8s version 1.3.1), using the Kubernetes executor, with all job pods being run on a dedicated, autoscaling node group. This node group is managed by CA (version 1.32).

All Gitlab job pods have the following properties:

not backed by a controller object
local storage (EmptyDir)
Annotation cluster-autoscaler.kubernetes.io/safe-to-evict is set to false

If I am reading the FAQ correctly, then any of these properties should stop a pod from getting evicted. I would assume that a node which holds any such pod will not be taken into account when CA determines which nodes are unneeded. Yet we keep seeing pods getting evicted, which is particularly frustrating because a lot of them run Terraform configurations, which then get state-locked due to the job being disrupted.

We have experimented with tweaking the parameters, the current configuration in the Helm values file is

extraArgs:
  scale-down-utilization-threshold: 0.01
  scale-down-unneeded-time: 15m
  cordon-node-before-terminating: true
  ignore-daemonsets-utilization: true

We tried to extend node-delete-delay-after-taint but that just leads to nodes sitting around hard-tainted and unusable, blocking the node group from being scaled up, and therefore new jobs not being able to schedule.

Our best guess at the moment is that CA does not appear to take pending pods into account when marking a node as unneeded, and then these pods stay on the node and get evicted when the node is finally scaled down. I know that the soft taint does not prevent new pods from being scheduled on a node, but still, I don't understand why none of the above properties, which according to the documentation should stop a pod from being evicted, don't seem to do so.

There are no errors in the cluster-autoscaler pod logs.

Is there anything else we can consider? This is starting to cause frustration among devs, as they have to keep rerunning CI/CD jobs.

Jun 03 '25 01:06 alex-hempel

autoscaler autoscaler copied to clipboard

Pods keep getting evicted when they shouldn't

autoscaler
autoscaler copied to clipboard