autoscaler
autoscaler copied to clipboard
predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version: 9.9.2
(via Helm
chart)
What k8s version are you using (kubectl version
)?:
kubectl version
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.6-eks-49a6c0", GitCommit:"49a6c0bf091506e7bafcdb1b142351b69363355a", GitTreeState:"clean", BuildDate:"2020-12-23T22:10:21Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"
What environment is this in?:
Cluster: EKS
(k8s
1.19)
original ASG's node type: r5.2xlarge
Other ASGs' node types: m5.2xlarge
and m5.xlarge
What did you expect to happen?:
After a node from an ASG was drained and terminated, the evicted pod indefinitely went in the Pending
state with the following error:
predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=
Note: The ASG group, which had that pod hosted in one of its nodes, was at max, and cannot spin new nodes. While the other ASGs could spin new nodes. Also, I tried to increase the number of nodes in the other ASGs, but the pod still could not be assign to those nodes with the same error. The CPU/RAM request and limit of the pod is less than of those of all ASGs.
What happened instead?:
The evicted pod should allocated to other nodes, or a new node from other ASGs should be added so that pod can be allocated.
How to reproduce it (as minimally and precisely as possible):
Possible scenario to reproduce: Have all the nodes of an ASG cordoned. Then, drain one of them, and then terminate it while reducing the max of that ASG by one at the same time.
Anything else we need to know?:
seems related to #3802 ?
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
any update?
/remove-lifecycle rotten
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
I saw this same error with one of our pods. The reason seemed to be because the pod had a nodeSelector that I didn't have a tag hint for in my ASG. I never had issues with pod scheduling with this pod if a node without taints was available, only on scale from zero.
This error could occur when the node has reached the maximum pods it can schedule due to the maximum pods available for the node instance type and you try to deploy a Daemonset that is supposed to schedule a pod on that node.
For example t3.medium can schedule a max 17 pods.
From the command below to check pods current nodes can run:
% kubectl get nodes -o=custom-columns=NODE:.metadata.name,MAX_PODS:.status.allocatable.pods,CAPACITY_PODS:.status.capacity.pods,INSTANCE_TYPE:.metadata.labels."node\.kubernetes\.io/instance-type"
NODE MAX_PODS CAPACITY_PODS INSTANCE_TYPE
ip-192-168-115-52.eu-west-1.compute.internal 17 17 t3.medium
ip-192-168-143-137.eu-west-1.compute.internal 17 17 t3.medium
ip-192-168-168-152.eu-west-1.compute.internal 17 17 t3.medium
Check pods running on these nodes:
% for node in $(k get nodes | awk '{if (NR!=1) {print $1}}'); do echo""; echo "Checking ${node}..."; k describe node ${node} | grep "Non-terminated" ; done
Checking ip-192-168-115-52.eu-west-1.compute.internal...
Non-terminated Pods: (11 in total)
Checking ip-192-168-143-137.eu-west-1.compute.internal...
Non-terminated Pods: (17 in total)
Checking ip-192-168-168-152.eu-west-1.compute.internal...
Non-terminated Pods: (15 in total)
The daemonset below couldn't schedule pod on the node: ip-192-168-143-137.eu-west-1.compute.internal
% kubectl --namespace=kube-system get pods -l "app=csi-secrets-store-provider-aws" -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
csi-secrets-store-provider-aws-7rmxz 1/1 Running 0 27m 192.168.115.52 ip-192-168-115-52.eu-west-1.compute.internal <none> <none>
csi-secrets-store-provider-aws-gstxp 0/1 Pending 0 28m <none> <none> <none> <none>
csi-secrets-store-provider-aws-sthzx 1/1 Running 0 28m 192.168.168.152 ip-192-168-168-152.eu-west-1.compute.internal <none> <none>
Since each pod is assigned its own IP address, the number of IP addresses supported by an instance type is a factor in determining the number of pods that can run on the instance. Checkout Amazon EKS recommended maximum pods for each Amazon EC2 instance type
A way around this: Change the instance type to one you can run more pods.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I'm running into this same issue. Is there a way for the cluster-autoscaler to say exhaustively which things did not match? It's very hard to determine why a match isn't being made from this error message.
same problem, nodeselector matches but event says it doesn't
I got the same error here, I have TAG the node group by
k8s.io/cluster-autoscaler/node-template/label
: something
But using the node selector for the deployment it kept saying node(s) didn't match Pod's node affinity/selector; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity/selector;
I'm not sure why, I'm using EKS node group, and scale from 0
This is exactly what we needed to solve cluster-autoscaler scaling from zero a node group with a taint:
- matched cluster-autoscaler version specifically for the version of k8s
- added tags to node group propagating down to ASG and instances
- matching
nodeselector
,tolerations
, andaffinity
on deployment object
Node Group labels | Values |
---|---|
k8s.io/cluster-autoscaler/enabled |
true |
k8s.io/cluster-autoscaler/<cluster_name> |
owned |
k8s.io/cluster-autoscaler/node-template/label/app |
highmem |
k8s.io/cluster-autoscaler/node-template/taint/dedicated |
highmem:NoSchedule |
Deployment Object Attributes:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: app
operator: In
values:
- highmem
nodeSelector:
app: highmem
tolerations:
- key: dedicated
operator: Equal
value: highmem
effect: NoSchedule
Cluster Autoscaler Auto Discovery Setup