autoscaler
autoscaler copied to clipboard
Pods not triggering scale up, but no reason given
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version: v1.21.1
What k8s version are you using (kubectl version)?:
kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-0389ca3", GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
What environment is this in?:
AWS
What did you expect to happen?:
I created some new pods that can't be scheduled because there isn't sufficient memory on nodes in the node group. There's plenty of room in the node group (currently at 1 node, max 8 nodes), so I expected the autoscaler to scale up.
What happened instead?:
The autoscaler is not scaling up and isn't giving any reason why:
I0114 17:07:43.794420 1 static_autoscaler.go:228] Starting main loop
I0114 17:07:44.280909 1 auto_scaling_groups.go:351] Regenerating instance to ASG map for ASGs: []
I0114 17:07:44.280932 1 auto_scaling.go:199] 0 launch configurations already in cache
I0114 17:07:44.280941 1 aws_manager.go:269] Refreshed ASG list, next refresh after 2022-01-14 17:08:44.280937364 +0000 UTC m=+485973.490954361
I0114 17:07:44.281150 1 filter_out_schedulable.go:65] Filtering out schedulables
I0114 17:07:44.281166 1 filter_out_schedulable.go:132] Filtered out 0 pods using hints
I0114 17:07:44.281584 1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I0114 17:07:44.281594 1 filter_out_schedulable.go:171] 0 pods marked as unschedulable can be scheduled.
I0114 17:07:44.281601 1 filter_out_schedulable.go:82] No schedulable pods
I0114 17:07:44.281615 1 klogx.go:86] Pod default/server-int-8bfcfccd-tm4v6 is unschedulable
I0114 17:07:44.281625 1 klogx.go:86] Pod default/server-int-8bfcfccd-pphd6 is unschedulable
I0114 17:07:44.281629 1 klogx.go:86] Pod default/server-int-8bfcfccd-5g8sn is unschedulable
I0114 17:07:44.281634 1 klogx.go:86] Pod default/server-int-6858f54b59-v8pm5 is unschedulable
I0114 17:07:44.281639 1 klogx.go:86] Pod default/server-int-6858f54b59-nhbzv is unschedulable
I0114 17:07:44.281644 1 klogx.go:86] Pod default/server-int-6858f54b59-lxs5b is unschedulable
I0114 17:07:44.281650 1 klogx.go:86] Pod default/server-int-8bfcfccd-jhm2q is unschedulable
I0114 17:07:44.281655 1 klogx.go:86] Pod default/server-int-6858f54b59-ggvww is unschedulable
I0114 17:07:44.281664 1 klogx.go:86] Pod default/server-int-8bfcfccd-bj6vp is unschedulable
I0114 17:07:44.281699 1 scale_up.go:376] Upcoming 0 nodes
I0114 17:07:44.282038 1 scale_up.go:453] No expansion options
I0114 17:07:44.282102 1 static_autoscaler.go:448] Calculating unneeded nodes
I0114 17:07:44.282120 1 pre_filtering_processor.go:57] Skipping ip-10-0-0-19.us-west-2.compute.internal - no node group config
I0114 17:07:44.282126 1 pre_filtering_processor.go:57] Skipping ip-10-0-0-162.us-west-2.compute.internal - no node group config
I0114 17:07:44.282149 1 static_autoscaler.go:502] Scale down status: unneededOnly=false lastScaleUpTime=2022-01-09 02:09:19.449955836 +0000 UTC m=+8.659972824 lastScaleDownDeleteTime=2022-01-09 02:09:19.449955917 +0000 UTC m=+8.659972906 lastScaleDownFailTime=2022-01-09 02:09:19.449956001 +0000 UTC m=+8.659972987 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0114 17:07:44.282178 1 static_autoscaler.go:515] Starting scale down
I0114 17:07:44.282220 1 scale_down.go:917] No candidates for scale down
Lots of unschedulable pods, but "Upcoming 0 nodes", "No expansion options", and no other meaningful logs here. If I kubectl describe one of the pods in question:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 77s (x121 over 21m) cluster-autoscaler pod didn't trigger scale-up:
Warning FailedScheduling 64s (x19 over 21m) default-scheduler 0/2 nodes are available: 2 Insufficient memory.
Note the pod didn't trigger scale-up:, but no reason is given.
Here's the status configmap:
$ kubectl -n kube-system get configmap cluster-autoscaler-status -o yaml
apiVersion: v1
data:
status: |
Cluster-autoscaler status at 2022-01-14 17:10:35.397165135 +0000 UTC:
Cluster-wide:
Health: Healthy (ready=2 unready=0 notStarted=0 longNotStarted=0 registered=2 longUnregistered=0)
LastProbeTime: 2022-01-14 17:10:35.395769547 +0000 UTC m=+486084.605786538
LastTransitionTime: 2022-01-09 02:09:29.450169002 +0000 UTC m=+18.660185990
ScaleUp: NoActivity (ready=2 registered=2)
LastProbeTime: 2022-01-14 17:10:35.395769547 +0000 UTC m=+486084.605786538
LastTransitionTime: 2022-01-09 02:09:29.450169002 +0000 UTC m=+18.660185990
ScaleDown: NoCandidates (candidates=0)
LastProbeTime: 2022-01-14 17:10:35.395769547 +0000 UTC m=+486084.605786538
LastTransitionTime: 2022-01-09 02:09:29.450169002 +0000 UTC m=+18.660185990
I have the correct tags configured on my nodegroups:

How to reproduce it (as minimally and precisely as possible):
This was because of misconfiguration; I installed autoscaler via helm with:
helm repo add autoscaler https://kubernetes.github.io/autoscaler && \
helm upgrade --install auto-scaler \
https://github.com/kubernetes/autoscaler/releases/download/cluster-autoscaler-chart-9.19.0/cluster-autoscaler-9.19.0.tgz \
--namespace kube-system \
--set 'image.tag'=v1.22.2 \
+ --set 'awsRegion'=us-west-2 \
--set 'autoDiscovery.clusterName'=${CLUSTER_NAME} \
--set 'rbac.serviceAccount.create'=false \
--set 'rbac.serviceAccount.name'=cluster-autoscaler \
Without the "--set 'awsRegion'" this reproduce the problem above. Adding it resolved the problem, but the error messages here could definitely be more clear.
I have exactly the same issue. Even if many of my nodes are under disk pressure, lots of pods cannot be scheduled, but still the autoscaler is not scaling up:
I0211 15:35:46.498813 1 scale_up.go:288] Pod restic-2r6cr can't be scheduled on eks-d6bf7003-b4ee-d0a4-ad23-a23ac5c64c8e, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
I0211 15:35:46.498845 1 scale_up.go:437] No pod can fit to eks-d6bf7003-b4ee-d0a4-ad23-a23ac5c64c8e
I0211 15:35:46.498872 1 scale_up.go:441] No expansion optiions
All the annotations are OK, and autoscaler was already able to scale down multiple nodes, but it never scaled up...
It's is quite a big issue as the autoscaler is barely usable in these conditions. Any plan to investigate on it?
I have same issue, for me its neither scaling down or up.
Still an issue on 1.21.2. I also downgraded to 1.21.0 and the same issue
any update on this?
I eventually got this working. I can't remember exactly which change it was that solved it but I think this was related to the amount of resources I'd assigned the pods. I had setup an over-provisioning pod which wasn't being picked up, but I had told it to request basically all of the CPU of a node, but the autoscaler wasn't able to provision that, as no node could actually allocate that. Possibly related?
Hi, is there any update on this issue?
faced with similar unclear issue
we use 1.20.2 and face issue that scale-up is not working:
we have 5 ASG each in different AWS AZ with the same instance type list
m5ad.xlarge 4 vCPUs
m5.xlarge 4 vCPUs
m4.xlarge 4 vCPUs
m6i.xlarge 4 vCPUs
m5d.xlarge 4 vCPUs
m5dn.xlarge 4 vCPUs
m5a.xlarge 4 vCPUs
m5n.xlarge 4 vCPUs
m5zn.xlarge 4 vCPUs
but scheduling of Pod with 3 CPU cores is not working :(
resources:
limits:
cpu: 3
memory: 128Mi
requests:
cpu: 3
memory: 128Mi
the same issue for "cpu: 2900m"
Error 🚫
pod didn't trigger scale-up: 2 max node group size reached, 3 Insufficient cpu
but "cpu: 2800m" request works and triggers scale-up ✅
pod triggered scale-up: [{eks-XXXX 5->6 (max: 7)}] it's strange
I see that all instances exist here https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.20.2/cluster-autoscaler/cloudprovider/aws/ec2_instance_types.go
but don't understand why Scale-up is not working. Maybe someone can explain?
looks that my usecase is resolved: I checked Node Allocatable --> cpu minus all daemonsets requested cpu and get max CPU request close to my previous message ~2800m
Aha!
Skipping ip-10-0-0-19.us-west-2.compute.internal - no node group config
It's finding the node groups, but it can't find any configuration for them. Why? In my case, because my node groups are in us-west-2, but when I installed autoscaler via helm I didn't add --set 'awsRegion'='us-west-2', so it's defaulting to looking for config info in us-east-1. -_-
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Reopen this issue with
/reopen - Mark this issue as fresh with
/remove-lifecycle rotten - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied- After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied- After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closedYou can:
- Reopen this issue with
/reopen- Mark this issue as fresh with
/remove-lifecycle rotten- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
same issue with kubernetes 1.24 (OCP 4.11) and Nutanix
REF: https://www.nutanix.dev/2022/08/16/red-hat-openshift-ipi-on-nutanix-cloud-platform/