autoscaler Cluster autoscaler need to respect topologySpreadConstraints

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: v1.25.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
```console
➜  ~ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:16:20Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"24+", GitVersion:"v1.24.8-eks-ffeb93d", GitCommit:"abb98ec0631dfe573ec5eae40dc48fd8f2017424", GitTreeState:"clean", BuildDate:"2022-11-29T18:45:03Z", GoVersion:"go1.18.8", Compiler:"gc", Platform:"linux/amd64"}
```

What environment is this in?:

EKS

What did you expect to happen?:

We've set topology spread constrains on our workload, and we expect nodes to be created across multiple AZs.

What happened instead?:

Nodes are being created under a single AZ. 截屏2023-02-08 15 09 51 You can see from snapshot above that workloads are all scheduled to nodegroup m6g-2xlarge-tidbcloud-system-eks-us-west-2-4b35b408-us-west-2c. No nodes are created under m6g-2xlarge-tidbcloud-system-eks-us-west-2-4b35b408-us-west-2a and m6g-2xlarge-tidbcloud-system-eks-us-west-2-4b35b408-us-west-2b. How to reproduce it (as minimally and precisely as possible):

Create three nodegroups that spans three AZs, with minimum node count set to 0, desired node count set to 0 and maximum node count set to 400. Create a workload with following topology spread constrain:

      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/component: tikv
            app.kubernetes.io/instance: db
            app.kubernetes.io/managed-by: tidb-operator
            app.kubernetes.io/name: tidb-cluster
        maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule

And what we found is workloads are being created under a single AZ.

Anything else we need to know?:

We suspect that it's related to the fact that we're scaling all node groups from zero.

Feb 08 '23 23:02 hanlins

You need to specify the label topology.kubernetes.io/zone as a tag for your vmss "k8s.io_cluster-autoscaler_node-template_label_topology.kubernetes.io_zone"

Feb 24 '23 11:02 h4wkmoon

I think we are seeing this as well on EKS.

The screenshot above is also from AWS, so we are not using vmss, but ASG instead.

During scheduling, I see this:

Events:
  Type     Reason             Age    From                Message
  ----     ------             ----   ----                -------
  Warning  FailedScheduling   7m17s  default-scheduler   0/4 nodes are available: 4 node(s) didn't satisfy existing pods anti-affinity rules.
  Normal   NotTriggerScaleUp  7m15s  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) didn't satisfy existing pods anti-affinity rules
  Normal   Scheduled          7m1s   default-scheduler   Successfully assigned namespace/pod-6cc7699fdd-dxfxs to ip-1-1-1-1.region.compute.internal

I have been scratching my head as to why topologySpreadConstraints isn't working on our cluster. Is this the reason?

Mar 23 '23 17:03 technotaff-nbs

You currently have 0 nodes in some of your node groups, which means that the autoscaler has no idea how many zones exist in your cluster. In your case, it only sees 1 zone and keeps spinning up all nodes in a single zone to satisfy the constraint.

Until EKS supports minDomains, you should ensure that at least 1 node is running in each zone so that the autoscaler is aware of the number of zones and can scale properly.

Apr 19 '23 10:04 kinyat

minDomains field is enabled by default in 1.27 but CA (tested with cluster-autoscaler:v1.27.2) does not spin up a new node.

Spread constraint definition:

      topologySpreadConstraints:
        - maxSkew: 1
          minDomains: 2
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: nginx

CA logs:

I0607 19:39:51.897194       1 static_autoscaler.go:289] Starting main loop
I0607 19:39:51.897545       1 aws_manager.go:185] Found multiple availability zones for ASG "eks-managed_ng_with_lt-62c44bae-1bd0-05aa-d82d-7f4643c608a3"; using eu-central-1a for failure-domain.beta.kubernetes.io/zone label
I0607 19:39:51.898019       1 aws_manager.go:185] Found multiple availability zones for ASG "eks-managed_ng_with_lt-62c44bae-1bd0-05aa-d82d-7f4643c608a3"; using eu-central-1a for failure-domain.beta.kubernetes.io/zone label
I0607 19:39:51.898205       1 filter_out_schedulable.go:63] Filtering out schedulables
I0607 19:39:51.898311       1 klogx.go:87] failed to find place for smash/nginx-deployment-66fb68bbf8-pjsxp: cannot put pod nginx-deployment-66fb68bbf8-pjsxp on any node
I0607 19:39:51.898322       1 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
I0607 19:39:51.898333       1 filter_out_schedulable.go:83] No schedulable pods
I0607 19:39:51.898338       1 filter_out_daemon_sets.go:40] Filtering out daemon set pods
I0607 19:39:51.898343       1 filter_out_daemon_sets.go:49] Filtered out 0 daemon set pods, 1 unschedulable pods left
I0607 19:39:51.898355       1 klogx.go:87] Pod smash/nginx-deployment-66fb68bbf8-pjsxp is unschedulable
I0607 19:39:51.898370       1 orchestrator.go:109] Upcoming 0 nodes
I0607 19:39:51.898473       1 orchestrator.go:466] Pod nginx-deployment-66fb68bbf8-pjsxp can't be scheduled on eks-managed_ng_with_lt-62c44bae-1bd0-05aa-d82d-7f4643c608a3, predicate checking error: node(s) didn't match pod topology spread constraints; predicateName=PodTopologySpread; reasons: node(s) didn't match pod topology spread constraints; debugInfo=
I0607 19:39:51.898489       1 orchestrator.go:167] No pod can fit to eks-managed_ng_with_lt-62c44bae-1bd0-05aa-d82d-7f4643c608a3
I0607 19:39:51.898500       1 orchestrator.go:172] No expansion options
I0607 19:39:51.898532       1 static_autoscaler.go:575] Calculating unneeded nodes
I0607 19:39:51.898571       1 eligibility.go:154] Node ip-10-0-1-17.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.582941)
I0607 19:39:51.898607       1 static_autoscaler.go:623] Scale down status: lastScaleUpTime=2023-06-07 18:39:31.714030436 +0000 UTC m=-3576.680943426 lastScaleDownDeleteTime=2023-06-07 18:39:31.714030436 +0000 UTC m=-3576.680943426 lastScaleDownFailTime=2023-06-07 18:39:31.714030436 +0000 UTC m=-3576.680943426 scaleDownForbidden=false scaleDownInCooldown=false
I0607 19:39:51.898638       1 static_autoscaler.go:632] Starting scale down
I0607 19:39:51.898657       1 legacy.go:296] No candidates for scale down

The ASG currently has only one node in eu-central-1a AZ but can add instances in three AZs (eu-central-1a, eu-central-1c and eu-central-1b) and "Capacity rebalance" is also enabled for the ASG.

Jun 07 '23 19:06 MohammadAlavi1986

Pretty sure you have to create a managed nodegroup/ASG per zone. Don't assign multiple zones to an ASG. It is documented in the autoscaler FAQ.

Jun 07 '23 20:06 buckleyGI

It works with one nodegroup per zone.

Jun 08 '23 22:06 MohammadAlavi1986

Running into the same issue with AKS. But like the other said it's probably because I use multiple zones (3) within one node pool. Too bad this doesn't work, but it's actually logical as the AS can't guarantee that a node from a specific zone will be spun up.

Oct 19 '23 14:10 kevinharing

https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.27.1/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0

You might need to attach a tag/label to your node pool as instructed ^

Oct 20 '23 18:10 peter-mueller-viacom

I have the same issues but I'm not using the zone as topology key but the hostname of the node

  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app.kubernetes.io/instance: nginx
    maxSkew: 1
    minDomains: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule

Oct 25 '23 14:10 rasta-rocket

I have the same issues but I'm not using the zone as topology key but the hostname of the node

  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app.kubernetes.io/instance: nginx
    maxSkew: 1
    minDomains: 2
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: DoNotSchedule

@rasta-rocket I believe you should changes your maxSkew to more than 1, if using kubernetes.io/hostname in topologyKey mean that each node required to have pods with match label maximum maxSkew configuration. lets say maxSkew 3, so each node will have the pods maximum 3 pods, otherwise not scheduled / trigger autoscaler.

Nov 22 '23 08:11 kholisrag

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Feb 20 '24 09:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Mar 21 '24 10:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Apr 20 '24 15:04 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 20 '24 15:04 k8s-ci-robot

autoscaler autoscaler copied to clipboard

Cluster autoscaler need to respect topologySpreadConstraints

autoscaler
autoscaler copied to clipboard