autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Cluster autoscaler deleting nodes containing pods with `safe-to-evict: false` annotation

Open blueprismo opened this issue 1 year ago • 4 comments

Which component are you using?: Cluster autoscaler

What version of the component are you using?: v1.27.1

Component version: v1.27.1 What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
```sh
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.13", GitCommit:"96b450c75ae3c48037f651b4777646dcca855ed0", GitTreeState:"clean", BuildDate:"2024-04-16T15:03:38Z", GoVersion:"go1.21.9", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27+", GitVersion:"v1.27.16-eks-2f46c53", GitCommit:"c1665482a8b066c35d81db51f8d8cc92aa598040", GitTreeState:"clean", BuildDate:"2024-07-25T04:23:25Z", GoVersion:"go1.22.5", Compiler:"gc", Platform:"linux/amd64"}
```

What environment is this in?: EKS - AWS

What did you expect to happen?: The Autoscaler sees the pod annotation cluster-autoscaler.kubernetes.io/safe-to-evict: false and respects it. Waiting for the pod to complete/finish before removing the node it is living on.

What happened instead?: The scale-down does NOT respect the cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation and deleted the node, therefore killing my very-important running pod.

How to reproduce it (as minimally and precisely as possible): Have your autoscaler with these values at v1.27:

./cluster-autoscaler
      --cloud-provider=aws
      --namespace=kube-system
      --node-group-auto-discovery=tagstagstags
      --logtostderr=true
      --stderrthreshold=info
      --v=4

## ASG configs:
# Desired: 2
# minimum: 1

Spawn your very important pod that shouldn't be killed:

apiVersion: v1
kind: Pod
metadata:
  namespace: awx
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
spec:
  containers:
    - image: 'quay.io/ansible/awx-ee:23.1.0'
      name: worker
      args:
        - ansible-runner
        - worker
        - '--private-data-dir=/runner'
      resources:
        limits:
          memory: 2Gi
          cpu: 2
        requests:
          memory: 500Mi
          cpu: 500m
  tolerations:
  - key: nodegroup-type
    operator: "Equal"
    value: on-demand
  nodeSelector:
    eks.amazonaws.com/capacityType: ON_DEMAND

Afterwards add some resource-locking deployment like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: exhaust-resources
  namespace: awx
spec:
  replicas: 5
  selector:
    matchLabels:
      app: exhaust-resources
  template:
    metadata:
      labels:
        app: exhaust-resources
    spec:
      tolerations:
      - key: nodegroup-type
        operator: "Equal"
        value: on-demand
      nodeSelector:
        eks.amazonaws.com/capacityType: ON_DEMAND
      containers:
      - name: exhaust-resources
        image: busybox
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        command: ["sh", "-c", "while true; do echo 'Running...'; sleep 30; done;"]

This will trigger a scale up of the pods. When the scale down happens, cross your fingers that the initial pod is not killed on the way. The annotation for the pod won't be respected at all.

Anything else we need to know?:

I have some hypothesis, like maybe the Instance scale-in protection from the ASG is disabled by default. And this may take precedence over any AutoScaler will.

Another one is that the annotation should be on a deployment level, because my very important workload is running directly from a pod. (no rs/deployment on top of it).

blueprismo avatar Sep 05 '24 15:09 blueprismo

/area cluster-autoscaler

adrianmoisey avatar Sep 06 '24 19:09 adrianmoisey

Anyone knows what is the latest version of cluster-autoscaler that it doesn't have this bug?

erdincmemsource avatar Oct 22 '24 19:10 erdincmemsource

Anyone knows what is the latest version of cluster-autoscaler that it doesn't have this bug?

I don't know, but I could manage to "work around it" by setting a crazy high Pod Disruption Budget with MinAvailable.

blueprismo avatar Oct 23 '24 14:10 blueprismo

@blueprismo have experienced a similar issue, we suspended the AZRebalance process (under ASG -> Advanced Configuration) on the ASG itself. We suspected this process was killing nodes to rebalance (outside of autoscaler control), causing behavior that seemed as though auto scaler wasn't respecting safe-to-evict annotation.

SeanZG8 avatar Oct 23 '24 18:10 SeanZG8

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 21 '25 18:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Feb 20 '25 19:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 22 '25 19:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Mar 22 '25 19:03 k8s-ci-robot