autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=

Open imriss opened this issue 3 years ago • 9 comments

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version: 9.9.2 (via Helm chart)

What k8s version are you using (kubectl version)?:

kubectl version
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.6-eks-49a6c0", GitCommit:"49a6c0bf091506e7bafcdb1b142351b69363355a", GitTreeState:"clean", BuildDate:"2020-12-23T22:10:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"

What environment is this in?:

Cluster: EKS (k8s 1.19) original ASG's node type: r5.2xlarge Other ASGs' node types: m5.2xlarge and m5.xlarge

What did you expect to happen?:

After a node from an ASG was drained and terminated, the evicted pod indefinitely went in the Pending state with the following error:

predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=

Note: The ASG group, which had that pod hosted in one of its nodes, was at max, and cannot spin new nodes. While the other ASGs could spin new nodes. Also, I tried to increase the number of nodes in the other ASGs, but the pod still could not be assign to those nodes with the same error. The CPU/RAM request and limit of the pod is less than of those of all ASGs.

What happened instead?:

The evicted pod should allocated to other nodes, or a new node from other ASGs should be added so that pod can be allocated.

How to reproduce it (as minimally and precisely as possible):

Possible scenario to reproduce: Have all the nodes of an ASG cordoned. Then, drain one of them, and then terminate it while reducing the max of that ASG by one at the same time.

Anything else we need to know?:

imriss avatar Apr 30 '21 22:04 imriss

seems related to #3802 ?

tsunamishaun avatar Jul 15 '21 18:07 tsunamishaun

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Dec 14 '21 14:12 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jan 13 '22 14:01 k8s-triage-robot

any update?

imriss avatar Jan 13 '22 18:01 imriss

/remove-lifecycle rotten

imriss avatar Jan 13 '22 18:01 imriss

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 13 '22 19:04 k8s-triage-robot

/remove-lifecycle stale

imriss avatar Apr 13 '22 21:04 imriss

I saw this same error with one of our pods. The reason seemed to be because the pod had a nodeSelector that I didn't have a tag hint for in my ASG. I never had issues with pod scheduling with this pod if a node without taints was available, only on scale from zero.

sidewinder12s avatar Jun 09 '22 00:06 sidewinder12s

This error could occur when the node has reached the maximum pods it can schedule due to the maximum pods available for the node instance type and you try to deploy a Daemonset that is supposed to schedule a pod on that node.

For example t3.medium can schedule a max 17 pods.

From the command below to check pods current nodes can run:

% kubectl get nodes -o=custom-columns=NODE:.metadata.name,MAX_PODS:.status.allocatable.pods,CAPACITY_PODS:.status.capacity.pods,INSTANCE_TYPE:.metadata.labels."node\.kubernetes\.io/instance-type"

NODE                                            MAX_PODS   CAPACITY_PODS   INSTANCE_TYPE
ip-192-168-115-52.eu-west-1.compute.internal    17         17              t3.medium
ip-192-168-143-137.eu-west-1.compute.internal   17         17              t3.medium
ip-192-168-168-152.eu-west-1.compute.internal   17         17              t3.medium

Check pods running on these nodes:

% for node in $(k get nodes | awk '{if (NR!=1) {print $1}}'); do echo""; echo "Checking ${node}..."; k describe node ${node} | grep "Non-terminated" ; done


Checking ip-192-168-115-52.eu-west-1.compute.internal...
Non-terminated Pods:          (11 in total)

Checking ip-192-168-143-137.eu-west-1.compute.internal...
Non-terminated Pods:          (17 in total)

Checking ip-192-168-168-152.eu-west-1.compute.internal...
Non-terminated Pods:          (15 in total)

The daemonset below couldn't schedule pod on the node: ip-192-168-143-137.eu-west-1.compute.internal

% kubectl --namespace=kube-system get pods -l "app=csi-secrets-store-provider-aws" -o wide                                                                 
NAME                                   READY   STATUS    RESTARTS   AGE   IP                NODE                                            NOMINATED NODE   READINESS GATES
csi-secrets-store-provider-aws-7rmxz   1/1     Running   0          27m   192.168.115.52    ip-192-168-115-52.eu-west-1.compute.internal    <none>           <none>
csi-secrets-store-provider-aws-gstxp   0/1     Pending   0          28m   <none>            <none>                                          <none>           <none>
csi-secrets-store-provider-aws-sthzx   1/1     Running   0          28m   192.168.168.152   ip-192-168-168-152.eu-west-1.compute.internal   <none>           <none>

Since each pod is assigned its own IP address, the number of IP addresses supported by an instance type is a factor in determining the number of pods that can run on the instance. Checkout Amazon EKS recommended maximum pods for each Amazon EC2 instance type

A way around this: Change the instance type to one you can run more pods.

berry2012 avatar Jul 22 '22 16:07 berry2012

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Oct 20 '22 16:10 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Nov 19 '22 17:11 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Dec 19 '22 17:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Dec 19 '22 17:12 k8s-ci-robot

I'm running into this same issue. Is there a way for the cluster-autoscaler to say exhaustively which things did not match? It's very hard to determine why a match isn't being made from this error message.

malcolmgreaves avatar Dec 22 '22 23:12 malcolmgreaves

same problem, nodeselector matches but event says it doesn't

mariorossi77 avatar Jan 26 '23 11:01 mariorossi77

I got the same error here, I have TAG the node group by k8s.io/cluster-autoscaler/node-template/label: something But using the node selector for the deployment it kept saying node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector; I'm not sure why, I'm using EKS node group, and scale from 0

thi-peony avatar Apr 03 '23 04:04 thi-peony

This is exactly what we needed to solve cluster-autoscaler scaling from zero a node group with a taint:

  • matched cluster-autoscaler version specifically for the version of k8s
  • added tags to node group propagating down to ASG and instances
  • matching nodeselector,tolerations, and affinity on deployment object
Node Group labels Values
k8s.io/cluster-autoscaler/enabled true
k8s.io/cluster-autoscaler/<cluster_name> owned
k8s.io/cluster-autoscaler/node-template/label/app highmem
k8s.io/cluster-autoscaler/node-template/taint/dedicated highmem:NoSchedule

Deployment Object Attributes:

     affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - highmem
     nodeSelector:
       app: highmem

     tolerations:
       - key: dedicated
         operator: Equal
         value: highmem
         effect: NoSchedule
 

Cluster Autoscaler Auto Discovery Setup

rljohnsn avatar Jul 20 '23 17:07 rljohnsn