autoscaler predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=

predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=

Open imriss opened this issue 3 years ago • 9 comments

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: 9.9.2 (via Helm chart)

What k8s version are you using (kubectl version)?:

kubectl version

Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.6-eks-49a6c0", GitCommit:"49a6c0bf091506e7bafcdb1b142351b69363355a", GitTreeState:"clean", BuildDate:"2020-12-23T22:10:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"

What environment is this in?:

Cluster: EKS (k8s 1.19) original ASG's node type: r5.2xlarge Other ASGs' node types: m5.2xlarge and m5.xlarge

What did you expect to happen?:

After a node from an ASG was drained and terminated, the evicted pod indefinitely went in the Pending state with the following error:

predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=

Note: The ASG group, which had that pod hosted in one of its nodes, was at max, and cannot spin new nodes. While the other ASGs could spin new nodes. Also, I tried to increase the number of nodes in the other ASGs, but the pod still could not be assign to those nodes with the same error. The CPU/RAM request and limit of the pod is less than of those of all ASGs.

What happened instead?:

The evicted pod should allocated to other nodes, or a new node from other ASGs should be added so that pod can be allocated.

How to reproduce it (as minimally and precisely as possible):

Possible scenario to reproduce: Have all the nodes of an ASG cordoned. Then, drain one of them, and then terminate it while reducing the max of that ASG by one at the same time.

Anything else we need to know?:

Apr 30 '21 22:04 imriss

seems related to #3802 ?

Jul 15 '21 18:07 tsunamishaun

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 14 '21 14:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Jan 13 '22 14:01 k8s-triage-robot

any update?

Jan 13 '22 18:01 imriss

/remove-lifecycle rotten

Jan 13 '22 18:01 imriss

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Apr 13 '22 19:04 k8s-triage-robot

/remove-lifecycle stale

Apr 13 '22 21:04 imriss

I saw this same error with one of our pods. The reason seemed to be because the pod had a nodeSelector that I didn't have a tag hint for in my ASG. I never had issues with pod scheduling with this pod if a node without taints was available, only on scale from zero.

Jun 09 '22 00:06 sidewinder12s

This error could occur when the node has reached the maximum pods it can schedule due to the maximum pods available for the node instance type and you try to deploy a Daemonset that is supposed to schedule a pod on that node.

For example t3.medium can schedule a max 17 pods.

From the command below to check pods current nodes can run:

% kubectl get nodes -o=custom-columns=NODE:.metadata.name,MAX_PODS:.status.allocatable.pods,CAPACITY_PODS:.status.capacity.pods,INSTANCE_TYPE:.metadata.labels."node\.kubernetes\.io/instance-type"

NODE                                            MAX_PODS   CAPACITY_PODS   INSTANCE_TYPE
ip-192-168-115-52.eu-west-1.compute.internal    17         17              t3.medium
ip-192-168-143-137.eu-west-1.compute.internal   17         17              t3.medium
ip-192-168-168-152.eu-west-1.compute.internal   17         17              t3.medium

Check pods running on these nodes:

% for node in $(k get nodes | awk '{if (NR!=1) {print $1}}'); do echo""; echo "Checking ${node}..."; k describe node ${node} | grep "Non-terminated" ; done


Checking ip-192-168-115-52.eu-west-1.compute.internal...
Non-terminated Pods:          (11 in total)

Checking ip-192-168-143-137.eu-west-1.compute.internal...
Non-terminated Pods:          (17 in total)

Checking ip-192-168-168-152.eu-west-1.compute.internal...
Non-terminated Pods:          (15 in total)

The daemonset below couldn't schedule pod on the node: ip-192-168-143-137.eu-west-1.compute.internal

% kubectl --namespace=kube-system get pods -l "app=csi-secrets-store-provider-aws" -o wide                                                                 
NAME                                   READY   STATUS    RESTARTS   AGE   IP                NODE                                            NOMINATED NODE   READINESS GATES
csi-secrets-store-provider-aws-7rmxz   1/1     Running   0          27m   192.168.115.52    ip-192-168-115-52.eu-west-1.compute.internal    <none>           <none>
csi-secrets-store-provider-aws-gstxp   0/1     Pending   0          28m   <none>            <none>                                          <none>           <none>
csi-secrets-store-provider-aws-sthzx   1/1     Running   0          28m   192.168.168.152   ip-192-168-168-152.eu-west-1.compute.internal   <none>           <none>

Since each pod is assigned its own IP address, the number of IP addresses supported by an instance type is a factor in determining the number of pods that can run on the instance. Checkout Amazon EKS recommended maximum pods for each Amazon EC2 instance type

A way around this: Change the instance type to one you can run more pods.

Jul 22 '22 16:07 berry2012

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 20 '22 16:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Nov 19 '22 17:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Dec 19 '22 17:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 19 '22 17:12 k8s-ci-robot

I'm running into this same issue. Is there a way for the cluster-autoscaler to say exhaustively which things did not match? It's very hard to determine why a match isn't being made from this error message.

Dec 22 '22 23:12 malcolmgreaves

same problem, nodeselector matches but event says it doesn't

Jan 26 '23 11:01 mariorossi77

I got the same error here, I have TAG the node group by k8s.io/cluster-autoscaler/node-template/label: something But using the node selector for the deployment it kept saying node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; I'm not sure why, I'm using EKS node group, and scale from 0

Apr 03 '23 04:04 thi-peony

This is exactly what we needed to solve cluster-autoscaler scaling from zero a node group with a taint:

matched cluster-autoscaler version specifically for the version of k8s
added tags to node group propagating down to ASG and instances
matching nodeselector,tolerations, and affinity on deployment object

Node Group labels	Values
`k8s.io/cluster-autoscaler/enabled`	`true`
`k8s.io/cluster-autoscaler/<cluster_name>`	`owned`
`k8s.io/cluster-autoscaler/node-template/label/app`	`highmem`
`k8s.io/cluster-autoscaler/node-template/taint/dedicated`	`highmem:NoSchedule`

Deployment Object Attributes:

     affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - highmem
     nodeSelector:
       app: highmem

     tolerations:
       - key: dedicated
         operator: Equal
         value: highmem
         effect: NoSchedule

Cluster Autoscaler Auto Discovery Setup

Jul 20 '23 17:07 rljohnsn

autoscaler autoscaler copied to clipboard

predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=

autoscaler
autoscaler copied to clipboard