autoscaler Scale up not triggered when a pod attached to a PV in a zone not belonging to any nodes

trafficstars

Which component are you using?:

cluster-autoscaler installed with helm chart using image registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.2

What version of the component are you using?:

Cluster autoscaler v1.26.2 (tried also with v1.27.3)

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"archive", BuildDate:"2023-06-15T08:14:06Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.6-eks-a5565ad", GitCommit:"895ed80e0cdcca657e88e56c6ad64d4998118590", GitTreeState:"clean", BuildDate:"2023-06-16T17:34:03Z", GoVersion:"go1.19.10", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

Using AWS Eks

What did you expect to happen?:

I was expecting cluster-autoscaler to trigger scale up of the cluster to try to spawn a new node in the zone matching the PV node-affinity.

Description: I deployed a statefulset which created a PV in the given region (us-east-1b)

Name:              pvc-70216a96-7bba-4b3c-b6d3-8716f1c2c208
Labels:            topology.kubernetes.io/region=us-east-1
                   topology.kubernetes.io/zone=us-east-1b
Annotations:       pv.kubernetes.io/migrated-to: ebs.csi.aws.com
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
                   volume.kubernetes.io/provisioner-deletion-secret-name:
                   volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers:        [kubernetes.io/pv-protection external-attacher/ebs-csi-aws-com]
StorageClass:      aws-ebs-gp2-0
Status:            Bound
Claim:             z9c5249e0-ze6867494/data-postgresqlz6c64e486-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          10Gi
Node Affinity:
  Required Terms:
    Term 0:        topology.kubernetes.io/zone in [us-east-1b]
                   topology.kubernetes.io/region in [us-east-1]
Message:
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   vol-0d2e7a39a667d6bc0
    FSType:     ext4
    Partition:  0
    ReadOnly:   false

I scaled down the statefulset to 0 replicas to save resources on the cluster, and after a while all nodes belonging to the zone us-east-1b were out of the cluster.

When I increased back the replica count of the statefulset to 1, the pod failed to be scheduled to the following errors:

0/6 nodes are available: 2 Insufficient cpu, 4 node(s) had volume node affinity conflict. preemption: 0/6 nodes are available: 2 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling.
pod didn't trigger scale-up: 1 node(s) had volume node affinity conflict

When looking at the cluster-autoscaler logs, I was seeing those messages in loop:

I0731 12:20:48.284837       1 binder.go:803] "PersistentVolume and node mismatch for pod" PV="pvc-70216a96-7bba-4b3c-b6d3-8716f1c2c208" node="ip-10-1-2-86.ec2.internal" pod="z9c5249e0-ze6867494/postgresqlz6c64e486-0" err="no matching NodeSelectorTerms"
I0731 12:20:48.284914       1 binder.go:803] "PersistentVolume and node mismatch for pod" PV="pvc-70216a96-7bba-4b3c-b6d3-8716f1c2c208" node="ip-10-1-77-181.ec2.internal" pod="z9c5249e0-ze6867494/postgresqlz6c64e486-0" err="no matching NodeSelectorTerms"
I0731 12:20:48.284986       1 binder.go:803] "PersistentVolume and node mismatch for pod" PV="pvc-70216a96-7bba-4b3c-b6d3-8716f1c2c208" node="ip-10-1-10-14.ec2.internal" pod="z9c5249e0-ze6867494/postgresqlz6c64e486-0" err="no matching NodeSelectorTerms"
I0731 12:20:48.285052       1 binder.go:803] "PersistentVolume and node mismatch for pod" PV="pvc-70216a96-7bba-4b3c-b6d3-8716f1c2c208" node="ip-10-1-12-82.ec2.internal" pod="z9c5249e0-ze6867494/postgresqlz6c64e486-0" err="no matching NodeSelectorTerms"
I0731 12:20:48.285067       1 hinting_simulator.go:116] failed to find place for z9c5249e0-ze6867494/postgresqlz6c64e486-0: cannot put pod postgresqlz6c64e486-0 on any node
I0731 12:20:48.285077       1 filter_out_schedulable.go:120] 0 pods marked as unschedulable can be scheduled.
I0731 12:20:48.285092       1 filter_out_schedulable.go:83] No schedulable pods
I0731 12:20:48.285104       1 klogx.go:87] Pod z9c5249e0-ze6867494/postgresqlz6c64e486-0 is unschedulable
I0731 12:20:48.285119       1 scale_up.go:194] Upcoming 0 nodes
I0731 12:20:48.285250       1 binder.go:783] "Could not get a CSINode object for the node" node="template-node-for-eks-qovery-2023032216162151640000000f-fac3850d-b51f-6b02-9d44-c12ebf06670e-5256576258169125047" err="csinode.storage.k8s.io \"template-node-for-eks-qovery-2023032216162151640000000f-fac3850d-b51f-6
b02-9d44-c12ebf06670e-5256576258169125047\" not found"
I0731 12:20:48.285287       1 binder.go:803] "PersistentVolume and node mismatch for pod" PV="pvc-70216a96-7bba-4b3c-b6d3-8716f1c2c208" node="template-node-for-eks-qovery-2023032216162151640000000f-fac3850d-b51f-6b02-9d44-c12ebf06670e-5256576258169125047" pod="z9c5249e0-ze6867494/postgresqlz6c64e486-0" err="no
matching NodeSelectorTerms"
I0731 12:20:48.285305       1 scale_up.go:93] Pod postgresqlz6c64e486-0 can't be scheduled on eks-qovery-2023032216162151640000000f-fac3850d-b51f-6b02-9d44-c12ebf06670e, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity co
nflict; debugInfo=
I0731 12:20:48.285328       1 scale_up.go:262] No pod can fit to eks-qovery-2023032216162151640000000f-fac3850d-b51f-6b02-9d44-c12ebf06670e
I0731 12:20:48.285342       1 scale_up.go:267] No expansion options

I was expecting, the cluster-autoscaler to trigger a scale up of the cluster to try to spawn a node in the correct zone, in order for the pod to be scheduled.

What happened instead?:

The cluster-autoscaler was doing nothing, instead of trying to trigger a cluster scale up.

How to reproduce it (as minimally and precisely as possible):

Create a stateful set with 1 replicas.
Look in which zone the pv is located
scale down to 0 the statefulset
Remove all nodes of the cluster belonging to this zone
Scale up the statefulset to 1 replicas

Anything else we need to know?:

This issue appears after a recent kubernetes upgrade to 1.26, before we were running cluster-autoscaler version 1.23.0 which does not have the issue. I tried to upgrade the version of the cluster-autoscaler to 1.27.3 and it has the same issue. I tried after to downgrade the version to 1.23.0 and it was working as expected, so I suppose it is a regression introduced in new versions.

I0731 12:44:16.257430       1 scale_up.go:468] Best option to resize: eks-qovery-2023032216162151640000000f-fac3850d-b51f-6b02-9d44-c12ebf06670e
I0731 12:44:16.257439       1 scale_up.go:472] Estimated 1 nodes needed in eks-qovery-2023032216162151640000000f-fac3850d-b51f-6b02-9d44-c12ebf06670e
I0731 12:44:16.257454       1 scale_up.go:595] Final scale-up plan: [{eks-qovery-2023032216162151640000000f-fac3850d-b51f-6b02-9d44-c12ebf06670e 8->9 (max: 14)}]
I0731 12:44:16.257462       1 scale_up.go:691] Scale-up: setting group eks-qovery-2023032216162151640000000f-fac3850d-b51f-6b02-9d44-c12ebf06670e size to 9

Jul 31 '23 13:07 erebe

Just to keep the ticket updated, we just got the issue also with the autoscaler in version 1.23.0. So the behavior seem to be un-relaiable at best.

│ I0818 14:08:41.167157       1 klogx.go:86] Pod z9c5249e0-ze6867494/postgresqlz6c64e486-0 is unschedulable                                                                                                        
│ I0818 14:08:41.167199       1 scale_up.go:376] Upcoming 0 nodes

There is still node in the specified AZ, I tried to manually launch an EC2 instance in this specific zone, and the machine went up, so it is not an issue with instance capacity.

Aug 18 '23 14:08 erebe

@erebe Hi, are you using a multi-zone node group? Your issue seems be similar to this

Sep 05 '23 06:09 fgksgf

I ran into the same issue (v1.26.4). Here are the logs:

I0905 16:04:56.135237       1 binder.go:783] "Could not get a CSINode object for the node" node="template-node-for-eks-hydra-prod-euc1-m6-large-a20230421175014009500000040-c0c3d278-1229-38ef-8fd6-13698cd3e934-5886348301106870267" err="csinode.storage.k8s.io \"template-node-for-eks-hydra-prod-euc1-m6-large-a20230421175014009500000040-c0c3d278-1229-38ef-8fd6-13698cd3e934-5886348301106870267\" not found"
I0905 16:04:56.135313       1 binder.go:803] "PersistentVolume and node mismatch for pod" PV="pvc-ddb417c2-2a75-49a3-8fb9-37bfe4e467f3" node="template-node-for-eks-hydra-prod-euc1-m6-large-a20230421175014009500000040-c0c3d278-1229-38ef-8fd6-13698cd3e934-5886348301106870267" pod="pg-pod/hy-fbe0fd95-2451-4ca3-8ee3-4397ab8fefaa-0" err="no matching NodeSelectorTerms"

I'm using multiple node groups with each node group per zone and have the --balance-similar-node-groups flag enabled. Cluster autoscaler could autoscale a node group from 0 so that a pod with pv can run on the first time. But after the pod was scaled down (wait for the node group to be scaled down to 0), and then scaled back up, cluster autoscaler couldn't autoscale the matching node group due to the abovementioned PV mismatch. Cluster autoscaler seems unaware that the pod has the PV and should autoscale the corresponding node group. Here are the reproduction steps:

Scale down the pod to 0
Wait for the corresponding node group is scaled down to 0 by cluster autoscaler
Scale up the pod to 1
Cluster autoscaler doesn't know which node group to scale and the pod is stuck at pending

Sep 06 '23 23:09 owenthereal

@owenthereal As the FAQ said, you should use separate node groups per zone in this case.

Sep 07 '23 01:09 fgksgf

@fgksgf:

@owenthereal As the FAQ said, you should use separate node groups per zone in this case.

I'm using separate node groups per zone (I updated my comment to clarify my setup)

Sep 07 '23 01:09 owenthereal

@erebe Hi, are you using a multi-zone node group? Your issue seems be similar to this

Sorry for the late response. You are right, in my case, we are using a single nodegroup for all the zones. But we are a bit eager to change our setup, as @owenthereal is still seeing this behavior in the recommended topology.

Sep 20 '23 15:09 erebe

I could confirm this behavior. We use v1.23.0. when a pod having pre-bound PV is looking for a node from an ASG which scales from 0, even if its part of same AZ, it does not scale up. My theory would be that, hypothetical node created by cluster autoscaler to match does not have this label when sacling from 0. [topology.ebs.csi.aws.com/zone,Operator:In,Values:[eu-west-1a]

Logs from autoscaler for the failure: log level -V(12)

Pod scale-up-az-1a-0 can't be scheduled on euw1-az1-xx, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=

Match for Required node selector terms &NodeSelector{NodeSelectorTerms:[]NodeSelectorTerm{NodeSelectorTerm{MatchExpressions:[]NodeSelectorRequirement{NodeSelectorRequirement{Key:topology.ebs.csi.aws.com/zone,Operator:In,Values:[eu-west-1a],},},MatchFields:[]NodeSelectorRequirement{},},},}

Nov 29 '23 22:11 Muthumj123

I could confirm this behavior. We use v1.23.0. _when a pod having pre-bound PV is looking for a node from an ASG which scales from 0, even if its part of same AZ, it does not scale up.

Similar configuration and issue here. All nodes in the cluster are in the same AZ, so no multi AZ. However, Statefulset with PVC cannot triggered new nodes when scaling from 0.

Jan 04 '24 08:01 jobcespedes

We ran into this issue where a node group spanning multiple AZs cannot reliably place pods [in the same zone where the PV is provisioned]. A workaround we implemented is to create a preemptive pod deployment with a lower priority class.

The pods are scheduled into each AZ, creating a buffer that allows the stateful workload to be placed in each AZ. There is unused capacity but this could be utilized with lower priority workloads workloads. This could also be extended with controller logic to dynamically scale the preemptive deployment HPA based on FailedToSchedule events in the namespace.

Feb 09 '24 13:02 gabeduke

Same issue here. We use K8s v1.29.4-eks-036c24b and CA v1.29.0.

We manually scaled a statefulset with PVC to 0 during several hours and when we scaled out up to 6 pods, CA was not able to launch new nodes to schedule the pods:

I0606 07:20:49.999865       1 binder.go:869] "Could not get a CSINode object for the node" node="template-node-for-eks-nodegroup-for-applications-graviton-4x-20221005-dec3252a-dbea-805a-139c-5e237de76d6a-5577249879759175710" err="csinode.storage.k8s.io \"template-node-for-eks-nodegroup-for-applications-xxxx-20221005-dec3252a-dbea-805a-139c-5e237de76d6a-5577249879759175710\" not found"
I0606 07:20:49.999913       1 binder.go:889] "PersistentVolume and node mismatch for pod" PV="pvc-cc80c313-29ff-46e8-941d-489ee4d9882f" node="template-node-for-eks-nodegroup-for-applications-xxxx-20221005-dec3252a-dbea-805a-139c-5e237de76d6a-5577249879759175710" pod="xxxx-yyyy-ads-processor/xxxx-yyyy-ads-processor-4" err="no matching NodeSelectorTerms"

Jun 06 '24 10:06 marlovil

Same here. SS pod with PV in AZ other then existing node are in lies unscheduled. Since FAQ describes the problem being specific for SS and I need just one replica I'll try to turn it into deployment. I contacted Azure Support when AKS version was 1.26 and they said that it is being fixed and should work in future version. Now I'm using 1.28.9 and the problem still persists. My SS uses PVC created separately, not from SS PVC template. But still.

Jun 28 '24 06:06 brunman-mikhail

I had this problem in an EKS cluster after upgrading EKS from 1.27 to 1.28 and then the autoscaler Helm chart from 9.29.1 to 9.34.0 (but the versions aren't important here).

I found out that a pod coudn't be scheduled because a gp3 PersistentVolume it needed had the topology.ebs.csi.aws.com/zone in [eu-west-1b] NodeSelector term, and I only had two nodes in my EKS cluster: one in the eu-west-1a Availability Zone and the other one in eu-west-1c.

Manually changing the Desired size of the EKS Node group from 2 to 3 (from the AWS web control panel) solved the issue, because another node was spawned, and this new one was in the eu-west-1b AZ. So now the previously unschedulable pods could be scheduled.

After some time, I noticed that the autoscaler downscaled the Desired size back to 2, and the node in eu-west-1c was removed. I received no errors on the monitoring system, so this means that all the pods were migrated away from it automatically. Great!

This is just how I solved the issue in my specific case :) It took me some time (and calls with my teammates) to fully understand it, so I thought it was worth sharing it with you.

Jul 19 '24 14:07 dmotte

It seems like we have the same or similar issue on scaleway. One node-pool per zone.

Warning  FailedScheduling   12s    default-scheduler   0/7 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 Insufficient cpu, 3 node(s) had volume node affinity conflict. preemption: 0/7 nodes are available: 3 Preemption is not helpful for scheduling, 4 No preemption victims found for incoming pod.
  Normal   NotTriggerScaleUp  4m19s  cluster-autoscaler  pod didn't trigger scale-up: 2 node(s) had volume node affinity conflict, 1 node(s) didn't match pod anti-affinity rules

The only two nodes which are in the right zone do not have enough CPU. In my understanding the cluster-autoscaler should trigger a scale-up in this case and not decide against it. 😅

Adding the node by hand would properly fix the error but its not a really sustainable solution.

Aug 07 '24 15:08 3deep5me

autoscaler autoscaler copied to clipboard

Scale up not triggered when a pod attached to a PV in a zone not belonging to any nodes

autoscaler
autoscaler copied to clipboard