autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

ClusterAutoscaler does not work with mixed instances weight

Open fcosta-td opened this issue 2 years ago • 3 comments

Which component are you using?: cluster-autoscaler

What version of the component are you using?: 1.23.0

What k8s version are you using (kubectl version)?: 1.21.5

What environment is this in?: AWS

We have EKS ASG with mixed instances and different weights

asg = {
  general = {
    instance_type         = "c5.2xlarge"
    desired_capacity    = null
    max_size                = 10
    min_size                 = 2
    on_demand_base_capacity = 2
    spot_instances          = true
    mixed_instances = [
      {
        type   = "c5.2xlarge"
        weight = 1
      },
      {
        type   = "c5a.2xlarge"
        weight = 1
      },
      {
        type   = "c5.4xlarge"
        weight = 2
      },
      {
        type   = "c5a.4xlarge"
        weight = 2
      },
    ]
    labels = {
      "node.mycompany.com/local-gpu"  = false
      "node.mycompany.com/local-nvme" = false
      "node.mycompany.com/nodegroup"  = "general"
    }
  }
}

What did you expect to happen?:

CA would launch the necessary nodes.

What happened instead?:

CA did not launch additional nodes because of instance weight (1 instance with weight of two, plus two instances with weight of one, are reported as 4, so AutoScaler ignores scale up)

ScaleUp:     NoActivity (ready=3 cloudProviderTarget=4)                                                                                                                                                         
                    LastProbeTime:      2022-05-04 10:28:04.758669894 +0000 UTC m=+134.153075785                                                                                            
                    LastTransitionTime: 2022-05-04 10:26:24.013344295 +0000 UTC m=+33.407750119

Log:

I0504 10:13:15.501163       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2022-05-04 10:14:15.501159899 +0000 UTC m=+3798493.211866322
I0504 10:13:15.501924       1 static_autoscaler.go:319] 1 unregistered nodes present
I0504 10:13:15.502103       1 filter_out_schedulable.go:65] Filtering out schedulables
I0504 10:13:15.502122       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0504 10:13:15.502169       1 filter_out_schedulable.go:157] Pod logstash-logging.logstash-logging-8 marked as unschedulable can be scheduled on node template-node-for-eks-general01-ops-stg-us-east-1-<MY_ASG>-upcoming-0. Ignoring in scale up.
I0504 10:13:15.502188       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0504 10:13:15.502199       1 filter_out_schedulable.go:171] 1 pods marked as unschedulable can be scheduled.
I0504 10:13:15.502210       1 filter_out_schedulable.go:79] Schedulable pods present
I0504 10:13:15.502233       1 static_autoscaler.go:401] No unschedulable pods
I0504 10:13:15.502251       1 static_autoscaler.go:448] Calculating unneeded nodes
Events:
  Type     Reason            Age                From                                 Message
  ----     ------            ----               ----                                 -------
  Warning  FailedScheduling  26s (x2 over 28s)  default-scheduler                    0/3 nodes are available: 3 Insufficient memory.
  Normal   Synced            16s (x3 over 28s)  ops-stg-aws-us-east-1-eks-general01  Pod synced successfully
  Normal   Synced            14s (x3 over 28s)  ops-stg-aws-us-east-1-eks-general01  Pod synced successfully

fcosta-td avatar May 04 '22 13:05 fcosta-td

This is as-designed. Cluster Autoscaler expects that every node in a node group (an ASG in this case) is identical, so using different weights (or, more fundamentally, different instance types) in a single ASG will not work.

drmorr0 avatar Jun 09 '22 21:06 drmorr0

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 07 '22 22:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 07 '22 22:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Nov 06 '22 22:11 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Nov 06 '22 22:11 k8s-ci-robot