autoscaler
                                
                                 autoscaler copied to clipboard
                                
                                    autoscaler copied to clipboard
                            
                            
                            
                        Pods keep getting evicted when they shouldn't
We have the following scenario:
We run gitlab-runner in Kubernetes (EKS, K8s version 1.3.1), using the Kubernetes executor, with all job pods being run on a dedicated, autoscaling node group. This node group is managed by CA (version 1.32).
All Gitlab job pods have the following properties:
- not backed by a controller object
- local storage (EmptyDir)
- Annotation cluster-autoscaler.kubernetes.io/safe-to-evictis set tofalse
If I am reading the FAQ correctly, then any of these properties should stop a pod from getting evicted. I would assume that a node which holds any such pod will not be taken into account when CA determines which nodes are unneeded. Yet we keep seeing pods getting evicted, which is particularly frustrating because a lot of them run Terraform configurations, which then get state-locked due to the job being disrupted.
We have experimented with tweaking the parameters, the current configuration in the Helm values file is
extraArgs:
  scale-down-utilization-threshold: 0.01
  scale-down-unneeded-time: 15m
  cordon-node-before-terminating: true
  ignore-daemonsets-utilization: true
We tried to extend node-delete-delay-after-taint but that just leads to nodes sitting around hard-tainted and unusable, blocking the node group from being scaled up, and therefore new jobs not being able to schedule.
Our best guess at the moment is that CA does not appear to take pending pods into account when marking a node as unneeded, and then these pods stay on the node and get evicted when the node is finally scaled down. I know that the soft taint does not prevent new pods from being scheduled on a node, but still, I don't understand why none of the above properties, which according to the documentation should stop a pod from being evicted, don't seem to do so.
There are no errors in the cluster-autoscaler pod logs.
Is there anything else we can consider? This is starting to cause frustration among devs, as they have to keep rerunning CI/CD jobs.