karpenter-provider-aws
karpenter-provider-aws copied to clipboard
Kubelet stopped posting node status - NotReady Nodes
Description
Observed Behavior:
- Nodes are running and healthy
- Suddenly the pods of a node turn into Terminating state
- Validating the node is in NotReady state
- Description of the node shows the following:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Tue, 17 Sep 2024 16:15:10 -0500 Tue, 17 Sep 2024 16:18:13 -0500 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Tue, 17 Sep 2024 16:15:10 -0500 Tue, 17 Sep 2024 16:18:13 -0500 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Tue, 17 Sep 2024 16:15:10 -0500 Tue, 17 Sep 2024 16:18:13 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Tue, 17 Sep 2024 16:15:10 -0500 Tue, 17 Sep 2024 16:18:13 -0500 NodeStatusUnknown Kubelet stopped posting node status.
- Events
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Unconsolidatable 60m (x60 over 21h) karpenter Can't remove without creating 2 candidates
Normal DisruptionBlocked 45m (x5 over 6h48m) karpenter Cannot disrupt Node: pdb "kube-system/coredns" prevents pod evictions
Normal DisruptionBlocked 43m (x2 over 47m) karpenter Cannot disrupt Node: pdb "istio-system/istiod" prevents pod evictions
Warning ContainerGCFailed 41m (x7 over 47m) kubelet failed to read podLogsRootDirectory "/var/log/pods": open /var/log/pods: too many open files
Warning ImageGCFailed 40m kubelet get filesystem info: Failed to get the info of the filesystem with mountpoint: cannot find filesystem info for device "/dev/nvme2n1p1"
Normal DisruptionBlocked 40m karpenter Cannot disrupt Node: NodePool "generic" not found
Normal NodeNotReady 39m node-controller Node ip-192-168-148-76.eu-central-1.compute.internal status is now: NodeNotReady
- Another kind of Events (from other node who suffers the same)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Unconsolidatable 27m (x116 over 35h) karpenter SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
Warning ContainerGCFailed 17m kubelet failed to read podLogsRootDirectory "/var/log/pods": open /var/log/pods: too many open files
Normal NodeNotReady 16m (x2 over 2d9h) node-controller Node ip-192-168-145-111.eu-central-1.compute.internal status is now: NodeNotReady
Normal DisruptionBlocked 13m (x3 over 17m) karpenter Cannot disrupt Node: pdb "knative-eventing/eventing-webhook" prevents pod evictions
- It seems the
node-controllerset the node intoNotReadyblocking (messing) the disruption logic from karpenter. - Node keeps in that state for ever (Never heal itself)
- Pods in the node are not able to terminate (They are in Terminating state forever)
- The only way for healing the node is using the aws cli to reboot the ec2 instance
- 6 EKS clusters with the same behavior
- I have seen clusters with 6 out of 12 nodes in that state
- This behavior started from the weekend, last week we didn't saw this behavior
Expected Behavior:
- Nodes are running and healthy
- Karpenter disrupt a node following Node disruption budgets
- Pods are moved into new nodes
- Node is deleted
- Pods are healthy in new nodes
Reproduction Steps (Please include YAML):
-
No human intervention to start watching the behavior
-
NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
karpenter.sh/nodepool-hash: "7878014288654516958"
karpenter.sh/nodepool-hash-version: v3
creationTimestamp: "2024-04-15T17:10:56Z"
generation: 3
labels:
kustomize.toolkit.fluxcd.io/name: karpenter-configs
kustomize.toolkit.fluxcd.io/namespace: flux-system
name: generic
resourceVersion: "1328500814"
uid: 1a4e3c12-4abb-4085-9698-8a8f791578f6
spec:
disruption:
budgets:
- nodes: 10%
consolidateAfter: 0s
consolidationPolicy: WhenEmptyOrUnderutilized
limits:
cpu: 1000
template:
spec:
expireAfter: 720h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: generic
requirements:
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- t3
- t3a
- m5
- m5a
- m6a
- m6i
- key: karpenter.k8s.aws/instance-size
operator: In
values:
- large
- xlarge
- 2xlarge
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-central-1a
- eu-central-1b
- eu-central-1c
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
- EC2NodeClass
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
annotations:
karpenter.k8s.aws/ec2nodeclass-hash: "10715260476625989988"
karpenter.k8s.aws/ec2nodeclass-hash-version: v3
creationTimestamp: "2024-04-15T17:10:55Z"
finalizers:
- karpenter.k8s.aws/termination
generation: 6
labels:
kustomize.toolkit.fluxcd.io/name: karpenter-configs
kustomize.toolkit.fluxcd.io/namespace: flux-system
name: generic
resourceVersion: "1328580165"
uid: 1e658352-e2cb-4e33-ac47-85a27acf3104
spec:
amiSelectorTerms:
- alias: bottlerocket@latest
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
volumeSize: 4Gi
volumeType: gp3
- deviceName: /dev/xvdb
ebs:
deleteOnTermination: true
volumeSize: 60Gi
volumeType: gp3
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 2
httpTokens: required
role: KarpenterNodeRole-prod
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod
subnetSelectorTerms:
- tags:
Name: '*Private*'
karpenter.sh/discovery: prod
tags:
nodepool: generic
purpose: prod
- The same behavior is observed with other nodepools/ec2nodeclases with different instance types including with gpu.
Versions:
- Chart Version: 1.0.1
- Kubernetes Version (
kubectl version): 1.29
It is observed that managed nodes (we have three of them in each cluster) use bottlerocket 1.21
OS Image: Bottlerocket OS 1.21.1 (aws-k8s-1.29)
How ever it seems karpenter nodes are now using bottlerocket 1.22
OS Image: Bottlerocket OS 1.22.0 (aws-k8s-1.29)
And some times the nodes that are in NotReady state are reporting Unknown
OS Image: Unknown
- I wonder if this could be a Bottlerocket Issue or a mix between Bottlerocket or karpenter.
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Temporary solution is to pin bottlerocket version to 1.21.1
Hey, I've got the same issue, but with
- Karpenter Chart Version :
1.1.0 - Kubernetes Version (kubectl version):
1.30, specificallyv1.30.4-eks-a737599 - OS Image:
Amazon Linux 2
@andrescaroc Are you still running into this issue? Have any clarity on the cause? I just ran into it across all of my clusters.
Hello. We are facing exact same issue.
- Karpenter Chart Version :
1.2.1 - EKS:
1.31.4 - OS Image:
Amazon Linux 2023
Either i need to manually delete the pods (forcefully) or nodeclaim/node (forcefully).
I even tried setting TerminationGracePeriod: 30s as a workaround, but no results.
Mostly we find this issue on on-demand instances(obvious).
I started encountering this issue after upgrading karpenter to 1.2.1(Been using karpenter since alpha)
Hi, We see the same issue in our clusters. This is seen for both bottlerocket and AL2 images
- Karpenter version: 1.0.5
- EKS version: v1.30.8
- OS Image: Amazon Linux 2/ bottlerocket
Same here:
We seen the problem in following combos:
| Karpenter | EKS | Bottlerocket |
|---|---|---|
| 1.2.1 | 1.31 | 1.30 |
| 1.2.1 | 1.31 | 1.31 |
| 1.2.1 | 1.32 | 1.32 |
Basically as long we stick on EKS 1.31 we so far are able to run stable with Bottlerocket 1.29, however to be able to bump to EKS 1.32 we will have to upgrade Bottlerocket.
Desperately looking for a solution!
One thing we found out is following:
failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config
@jammerful Yes, I still have this issue, I see the problem in following set ups:
| Karpenter | EKS | Bottlerocket |
|---|---|---|
| 1.0.1 | 1.29 | 1.22.0 |
| 1.0.1 | 1.31 | 1.27.1 |
Can anyone provide instance IDs and node names for particular nodes? Then we can take a closer look.
Can anyone provide instance IDs and node names for particular nodes? Then we can take a closer look.
@Sparksssj
id: aws:///eu-central-1b/i-011e10919f8e6d533
name: ip-192-168-179-209.eu-central-1.compute.internal
Can anyone provide instance IDs and node names for particular nodes? Then we can take a closer look.
@Sparksssj
ids: i-08ee3489048ea9781, i-0e4d1d9ae05437854 + many more
names: ip-172-31-64-156.ap-south-1.compute.internal, ip-172-31-116-107.ap-south-1.compute.internal + many more
ami id: ami-0dce739b024d12140
I have been experiencing the exact same issue with bottlerocket nodes, Error message: Kubelet stopped sending node status, Node stays in 'NotReady' state forever, leaving pods on it in 'Terminating' state
- Karpenter verison - 1.0.3
- EKS version - 1.30
- Bottlerocket OS version - latest
- Node Type - On-demand
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Sun, 16 Feb 2025 01:40:57 -0500 Sun, 16 Feb 2025 01:43:02 -0500 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Sun, 16 Feb 2025 01:40:57 -0500 Sun, 16 Feb 2025 01:43:02 -0500 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Sun, 16 Feb 2025 01:40:57 -0500 Sun, 16 Feb 2025 01:43:02 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Sun, 16 Feb 2025 01:40:57 -0500 Sun, 16 Feb 2025 01:43:02 -0500 NodeStatusUnknown Kubelet stopped posting node status.
Observed the same issue.
Instance id: i-004e3c50711039056 (ip-10-227-233-253.eu-central-1.compute.internal)
AMI id: ami-06c1dd94700c2647e
- Karpenter: 0.37.6
- EKS version: 1.30
- Bottlerocket: 1.32.0
- On-demand nodes
I notice @andrescaroc's NodePool excludes them, however on our clusters (see my colleague @marcofranssen's comment for specifics but we are on latest Karpenter/EKS/Bottlerocket versions) I can reproduce this on arm64 medium (8Gi) instance types. We do not observe this issue at all on nodes with >8Gi memory but that could be a red herring of course.
All of our workloads have memory requests/limits set, however when Karpenter tries to schedule a couple pods with moderate memory requests that would seem to fit comfortably, the node completely falls over shortly after startup:
[ 35.112557] Memory cgroup out of memory: Killed process 2813 (runc:[2:INIT]) total-vm:1539692kB, anon-rss:2704kB, file-rss:1196kB, shmem-rss:5616kB, UID:0 pgtables:140kB oom_score_adj:-997
[ 60.172244] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [systemd:1]
[ 88.172091] watchdog: BUG: soft lockup - CPU#0 stuck for 53s! [systemd:1]
[ 94.122058] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 94.122543] rcu: All QSes seen, last rcu_sched kthread activity 5902 (4294946691-4294940789), jiffies_till_next_fqs=1, root ->qsmask 0x0
[ 94.123471] rcu: rcu_sched kthread starved for 5902 jiffies! g10881 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 94.124245] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 94.124934] rcu: RCU grace-period kthread stack dump:
[ 94.125356] rcu: Stack dump where RCU GP kthread last ran:
[ 120.171915] watchdog: BUG: soft lockup - CPU#0 stuck for 82s! [systemd:1]
[ 148.171762] watchdog: BUG: soft lockup - CPU#0 stuck for 108s! [systemd:1]
[ 176.171609] watchdog: BUG: soft lockup - CPU#0 stuck for 135s! [systemd:1]
[ 204.171455] watchdog: BUG: soft lockup - CPU#0 stuck for 161s! [systemd:1]
[ 232.171302] watchdog: BUG: soft lockup - CPU#0 stuck for 187s! [systemd:1]
[ 260.171148] watchdog: BUG: soft lockup - CPU#0 stuck for 213s! [systemd:1]
[ 271.171089] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 271.171573] rcu: All QSes seen, last rcu_sched kthread activity 23607 (4294964396-4294940789), jiffies_till_next_fqs=1, root ->qsmask 0x0
[ 271.172502] rcu: rcu_sched kthread starved for 23607 jiffies! g10881 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 271.173279] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 271.173966] rcu: RCU grace-period kthread stack dump:
[ 271.174387] rcu: Stack dump where RCU GP kthread last ran:
[ 296.170951] watchdog: BUG: soft lockup - CPU#0 stuck for 246s! [systemd:1]
[ 324.170798] watchdog: BUG: soft lockup - CPU#0 stuck for 272s! [systemd:1]
[ 352.170644] watchdog: BUG: soft lockup - CPU#0 stuck for 298s! [systemd:1]
At this point due to the CPU lockup the node is unreachable and of course the node remains stuck in NotReady status. I've tried and reverted a couple configuration changes that had no effect:
- Adding
systemReservedconfiguration for kubelet to reserve 100m CPU and 100Mi memory - Lowering
kubeReservedmemory from 1465Mi to 768Mi (I thought perhaps this is an excessive reservation for the smaller node type anyway, and extra free memory for workloads would help)
Edit: In our case the affected nodes are single CPU, so minus the reservation for kubelet there is only 840m allocatable. So the node behavior under CPU contention could also be a factor.
Seems Bottlerocket team claims it is an issue upstream https://github.com/bottlerocket-os/bottlerocket/issues/4399#issuecomment-2657449645
In our case the affected nodes are single CPU, so minus the reservation for kubelet there is only 840m allocatable. So the node behavior under CPU contention could also be a factor.
We have removed node smaller then 2 CPU and 4GB RAM from the node pool for now. This has reduced the number of instances of kubelet not posting updates but not removed it completely
Talking to the Bottlerocket team to understand what's the next step for this issue.
Possible Fix: Allocating Reserved Resources for Kubelet and System Processes
Hey everyone,
We were facing a similar issue where nodes (running Amazon Linux 2023) would randomly get stuck in a Terminating state, and Kubelet would stop posting node status updates. This was severely impacting our workloads as pods remained stuck indefinitely.
What Fixed It for Us
After some debugging, we found that Kubelet and system processes were getting starved of resources, leading to failures in node health updates and pod evictions. To mitigate this, we explicitly allocated reserved resources using the following Karpenter configuration:
systemReserved:
cpu: 200m
memory: 200Mi
ephemeral-storage: 1Gi
kubeReserved:
cpu: 200m
memory: 400Mi
ephemeral-storage: 3Gi
Would love to hear if others are facing the same issue and if similar configurations help in their environments. Hope this helps!
We actually wanted to increase the reservation only for smaller instances. i.e. < 4 core (i.e. 1 or 2 core nodes).
Unfortunately it seems Ec2nodeclass for Bottlerocket does not support scripting/templating in userData. It expects a valid TOML file.
https://karpenter.sh/docs/concepts/nodeclasses/#bottlerocket-2
Supporting templating like the following, based on the nodeclaim labels could help us in selectively applying the reservation.
{{ if eq .karpenter_k8s_aws_instance_cpu "1" "2" }}
[settings.kubernetes.system-reserved]
memory = "500Mi"
{{ end }}
@Shreyank031 thanks for the suggestion, as mentioned upthread we use Bottlerocket nodes exclusively and had previously tried setting spec.kubelet.systemReserved fields. I set them to match your values, however still encountered NotReady nodes after forcing rescheduling with cordon/drain (we're running EKS 1.32, Karpenter 1.2.x and Bottlerocket 1.34.0). Two observations still stand out:
- For non-burstable instance types, > 1 vCPU or >= 8Gi seems to be the minimum spec to prevent this condition
- I don't observe this condition with burstable instances (
t4g.mediuminstances and larger) perhaps only because they have 2 or more vCPUs?
@Archandan for Bottlerocket it appears the sum of the system/kube reserved memory fromspec.kublet is subtracted from total memory to set /sys/fs/cgroup/kubepods.slice/memory.max. It's not clear to me that CPU limits are enforced at all, which might explain how this is affecting lower-resource nodes or those experiencing high contention.
@gordonm In our setup for some of these hung nodes we were able to see high memory usage. It is possible that CPU might become a bottleneck in some cases but for us it was mostly RAM.
I see there is RAM and CPU reservation defined in the file below, but I have not checked the same on a node to confirm https://github.com/bottlerocket-os/bottlerocket/blob/develop/sources/api/schnauzer/src/helpers.rs (#L1204, #L1137)
Is there any further update/insight on this issue ?
systemReserved: cpu: 200m memory: 200Mi ephemeral-storage: 1Gi kubeReserved: cpu: 200m memory: 400Mi ephemeral-storage: 3GiWould love to hear if others are facing the same issue and if similar configurations help in their environments. Hope this helps!
This helped a lot but did not completely fix the issue.
Is there any further update/insight on this issue ? We are facing similar issue with Karpenter. Karpenter Version : 1.3.3 AWS nodes (EKS): Graviton and AL2023 machines
Following up: Our workaround is to follow the same requirements for minimum node size used by EKS Auto Mode. So our NodePool requirements now include:
- key: karpenter.k8s.aws/instance-cpu
operator: Gt
values:
- "1"
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "4000"
This seems to prevent the resource starvation that leads to the issue reported here and has allowed us to update to the latest EKS and Bottlerocket releases.
FWIW, we were encountering this issue primarily with c5.large instances -- we could not keep them online for longer than 10 minutes after startup.
After upping the NodePool requirements to using instance types with
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "4096"
(as c5.large has 4096), along with Shreyank031's fix above, we're no longer losing nodes.
We still observe the issue:
| Karpenter | EKS | Bottlerocket |
|---|---|---|
| 1.0.1 | 1.29 | 1.22.0 |
| 1.0.1 | 1.31 | 1.27.1 |
| 1.4.0 | 1.31 | 1.37.0 |
Now, all our instances defined system-reserved and kube-reserved resources as suggested in previous messages:
...
kubelet:
systemReserved:
cpu: 200m
memory: 200Mi
ephemeral-storage: 1Gi
kubeReserved:
cpu: 200m
memory: 400Mi
ephemeral-storage: 3Gi
...
Instance type observed: t3.2xlarge
Do we have updates about the issue or possible solutions?
@andrescaroc we resolved it by only have karpenter use nodes with 2 or more cpus, we don't touch the kubelet reserved settings. It is far from ideal as this also results in nodes that are underutilized in our clusters.
Preferably I wish these kubelet reserved settings would get sane defaults per node size. E.g. more reserved for larger nodes, less reserved on smaller nodes (e.g. smaller nodes have less pods to coordinate).
It could be achieved by multiple nodeclasses (1 per instance size/type) and then multiple pools (1 for each nodeclass), however that requires someone to figure out the right kubelet values and such, and it will require every end user to define those node classes and defaults.
Maybe this thing is a candidate for another CRD, that can be used to inject kubelet reservations into ec2-nodeclasses. Thinking about this to simplify the enduser experience of not having to manage many ec2 nodeclasses.
[!Important] Maintainers: I think this problem should be The NR 1 priority, as the problem described in this issue is the main deal breaker for Karpenter and leads to broken clusters.
Any thoughts on my ideas above?
Could it be this is resolving some of the problems we see here?
https://github.com/aws/karpenter-provider-aws/releases/tag/v1.5.2
@marcofranssen Unfortunately, we saw the same behaviour even with v1.5.2.
I'm not sure if this is actually a OOM issue, in our case I just observed that the pod is not shutting down cleanly. The node already had a deletedTimestamp, but the finalizer of Karpenter has not been removed correctly.
see the logs, the node is stuck because karpenter still reports DisruptionBlocked and TerminationGracePeriodExpiring way after the TTL has been reached.
After removing the Finalizer manually, the node gets removed correctly by node-controller.