autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Pods not triggering scale up, but no reason given

Open jwalton opened this issue 3 years ago • 11 comments

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: v1.21.1

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS

What did you expect to happen?:

I created some new pods that can't be scheduled because there isn't sufficient memory on nodes in the node group. There's plenty of room in the node group (currently at 1 node, max 8 nodes), so I expected the autoscaler to scale up.

What happened instead?:

The autoscaler is not scaling up and isn't giving any reason why:

I0114 17:07:43.794420       1 static_autoscaler.go:228] Starting main loop
I0114 17:07:44.280909       1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: []
I0114 17:07:44.280932       1 auto_scaling.go:199] 0 launch configurations already in cache
I0114 17:07:44.280941       1 aws_manager.go:269] Refreshed ASG list, next refresh after 2022-01-14 17:08:44.280937364 +0000 UTC m=+485973.490954361
I0114 17:07:44.281150       1 filter_out_schedulable.go:65] Filtering out schedulables
I0114 17:07:44.281166       1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0114 17:07:44.281584       1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0114 17:07:44.281594       1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0114 17:07:44.281601       1 filter_out_schedulable.go:82] No schedulable pods
I0114 17:07:44.281615       1 klogx.go:86] Pod default/server-int-8bfcfccd-tm4v6 is unschedulable
I0114 17:07:44.281625       1 klogx.go:86] Pod default/server-int-8bfcfccd-pphd6 is unschedulable
I0114 17:07:44.281629       1 klogx.go:86] Pod default/server-int-8bfcfccd-5g8sn is unschedulable
I0114 17:07:44.281634       1 klogx.go:86] Pod default/server-int-6858f54b59-v8pm5 is unschedulable
I0114 17:07:44.281639       1 klogx.go:86] Pod default/server-int-6858f54b59-nhbzv is unschedulable
I0114 17:07:44.281644       1 klogx.go:86] Pod default/server-int-6858f54b59-lxs5b is unschedulable
I0114 17:07:44.281650       1 klogx.go:86] Pod default/server-int-8bfcfccd-jhm2q is unschedulable
I0114 17:07:44.281655       1 klogx.go:86] Pod default/server-int-6858f54b59-ggvww is unschedulable
I0114 17:07:44.281664       1 klogx.go:86] Pod default/server-int-8bfcfccd-bj6vp is unschedulable
I0114 17:07:44.281699       1 scale_up.go:376] Upcoming 0 nodes
I0114 17:07:44.282038       1 scale_up.go:453] No expansion options
I0114 17:07:44.282102       1 static_autoscaler.go:448] Calculating unneeded nodes
I0114 17:07:44.282120       1 pre_filtering_processor.go:57] Skipping ip-10-0-0-19.us-west-2.compute.internal - no node group config
I0114 17:07:44.282126       1 pre_filtering_processor.go:57] Skipping ip-10-0-0-162.us-west-2.compute.internal - no node group config
I0114 17:07:44.282149       1 static_autoscaler.go:502] Scale down status: unneededOnly=false lastScaleUpTime=2022-01-09 02:09:19.449955836 +0000 UTC m=+8.659972824 lastScaleDownDeleteTime=2022-01-09 02:09:19.449955917 +0000 UTC m=+8.659972906 lastScaleDownFailTime=2022-01-09 02:09:19.449956001 +0000 UTC m=+8.659972987 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0114 17:07:44.282178       1 static_autoscaler.go:515] Starting scale down
I0114 17:07:44.282220       1 scale_down.go:917] No candidates for scale down

Lots of unschedulable pods, but "Upcoming 0 nodes", "No expansion options", and no other meaningful logs here. If I kubectl describe one of the pods in question:

Events:
  Type     Reason             Age                  From                Message
  ----     ------             ----                 ----                -------
  Normal   NotTriggerScaleUp  77s (x121 over 21m)  cluster-autoscaler  pod didn't trigger scale-up:
  Warning  FailedScheduling   64s (x19 over 21m)   default-scheduler   0/2 nodes are available: 2 Insufficient memory.

Note the pod didn't trigger scale-up:, but no reason is given.

Here's the status configmap:

$ kubectl -n kube-system get configmap cluster-autoscaler-status -o yaml
apiVersion: v1
data:
  status: |
    Cluster-autoscaler status at 2022-01-14 17:10:35.397165135 +0000 UTC:
    Cluster-wide:
      Health:      Healthy (ready=2 unready=0 notStarted=0 longNotStarted=0 registered=2 longUnregistered=0)
                   LastProbeTime:      2022-01-14 17:10:35.395769547 +0000 UTC m=+486084.605786538
                   LastTransitionTime: 2022-01-09 02:09:29.450169002 +0000 UTC m=+18.660185990
      ScaleUp:     NoActivity (ready=2 registered=2)
                   LastProbeTime:      2022-01-14 17:10:35.395769547 +0000 UTC m=+486084.605786538
                   LastTransitionTime: 2022-01-09 02:09:29.450169002 +0000 UTC m=+18.660185990
      ScaleDown:   NoCandidates (candidates=0)
                   LastProbeTime:      2022-01-14 17:10:35.395769547 +0000 UTC m=+486084.605786538
                   LastTransitionTime: 2022-01-09 02:09:29.450169002 +0000 UTC m=+18.660185990

I have the correct tags configured on my nodegroups:

image

How to reproduce it (as minimally and precisely as possible):

This was because of misconfiguration; I installed autoscaler via helm with:

  helm repo add autoscaler https://kubernetes.github.io/autoscaler && \
    helm upgrade --install auto-scaler \
      https://github.com/kubernetes/autoscaler/releases/download/cluster-autoscaler-chart-9.19.0/cluster-autoscaler-9.19.0.tgz \
      --namespace kube-system \
      --set 'image.tag'=v1.22.2 \
+     --set 'awsRegion'=us-west-2 \
      --set 'autoDiscovery.clusterName'=${CLUSTER_NAME} \
      --set 'rbac.serviceAccount.create'=false \
      --set 'rbac.serviceAccount.name'=cluster-autoscaler \

Without the "--set 'awsRegion'" this reproduce the problem above. Adding it resolved the problem, but the error messages here could definitely be more clear.

jwalton avatar Jan 14 '22 17:01 jwalton

I have exactly the same issue. Even if many of my nodes are under disk pressure, lots of pods cannot be scheduled, but still the autoscaler is not scaling up:

I0211 15:35:46.498813       1 scale_up.go:288] Pod restic-2r6cr can't be scheduled on eks-d6bf7003-b4ee-d0a4-ad23-a23ac5c64c8e, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0211 15:35:46.498845       1 scale_up.go:437] No pod can fit to eks-d6bf7003-b4ee-d0a4-ad23-a23ac5c64c8e
I0211 15:35:46.498872       1 scale_up.go:441] No expansion optiions

All the annotations are OK, and autoscaler was already able to scale down multiple nodes, but it never scaled up...

It's is quite a big issue as the autoscaler is barely usable in these conditions. Any plan to investigate on it?

headyj avatar Feb 11 '22 15:02 headyj

I have same issue, for me its neither scaling down or up.

vikas-shaw avatar Feb 13 '22 18:02 vikas-shaw

Still an issue on 1.21.2. I also downgraded to 1.21.0 and the same issue

pareeohnos avatar Feb 19 '22 20:02 pareeohnos

any update on this?

ntman4real avatar Mar 28 '22 23:03 ntman4real

I eventually got this working. I can't remember exactly which change it was that solved it but I think this was related to the amount of resources I'd assigned the pods. I had setup an over-provisioning pod which wasn't being picked up, but I had told it to request basically all of the CPU of a node, but the autoscaler wasn't able to provision that, as no node could actually allocate that. Possibly related?

pareeohnos avatar Mar 29 '22 08:03 pareeohnos

Hi, is there any update on this issue?

PhilipPenquitt avatar May 12 '22 14:05 PhilipPenquitt

faced with similar unclear issue

we use 1.20.2 and face issue that scale-up is not working:

we have 5 ASG each in different AWS AZ with the same instance type list

m5ad.xlarge 4 vCPUs
m5.xlarge 4 vCPUs
m4.xlarge 4 vCPUs
m6i.xlarge 4 vCPUs
m5d.xlarge 4 vCPUs
m5dn.xlarge 4 vCPUs
m5a.xlarge 4 vCPUs
m5n.xlarge 4 vCPUs
m5zn.xlarge 4 vCPUs

but scheduling of Pod with 3 CPU cores is not working :(

resources:
  limits:
    cpu: 3
    memory: 128Mi
  requests:
    cpu: 3
    memory: 128Mi

the same issue for "cpu: 2900m"

Error 🚫

pod didn't trigger scale-up: 2 max node group size reached, 3 Insufficient cpu

but "cpu: 2800m" request works and triggers scale-up ✅

pod triggered scale-up: [{eks-XXXX 5->6 (max: 7)}] it's strange

I see that all instances exist here https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.20.2/cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go

but don't understand why Scale-up is not working. Maybe someone can explain?

azhurbilo avatar May 25 '22 18:05 azhurbilo

looks that my usecase is resolved: I checked Node Allocatable --> cpu minus all daemonsets requested cpu and get max CPU request close to my previous message ~2800m

azhurbilo avatar May 26 '22 18:05 azhurbilo

Aha!

Skipping ip-10-0-0-19.us-west-2.compute.internal - no node group config

It's finding the node groups, but it can't find any configuration for them. Why? In my case, because my node groups are in us-west-2, but when I installed autoscaler via helm I didn't add --set 'awsRegion'='us-west-2', so it's defaulting to looking for config info in us-east-1. -_-

jwalton avatar Jun 03 '22 15:06 jwalton

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Sep 01 '22 16:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Oct 01 '22 16:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Oct 31 '22 16:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 31 '22 16:10 k8s-ci-robot

same issue with kubernetes 1.24 (OCP 4.11) and Nutanix

REF: https://www.nutanix.dev/2022/08/16/red-hat-openshift-ipi-on-nutanix-cloud-platform/

abdennour avatar Dec 08 '22 05:12 abdennour