karpenter-provider-aws icon indicating copy to clipboard operation
karpenter-provider-aws copied to clipboard

Kubelet stopped posting node status - NotReady Nodes

Open andrescaroc opened this issue 1 year ago • 26 comments

Description

Observed Behavior:

  • Nodes are running and healthy
  • Suddenly the pods of a node turn into Terminating state
  • Validating the node is in NotReady state
  • Description of the node shows the following:
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Tue, 17 Sep 2024 16:15:10 -0500   Tue, 17 Sep 2024 16:18:13 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 17 Sep 2024 16:15:10 -0500   Tue, 17 Sep 2024 16:18:13 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Tue, 17 Sep 2024 16:15:10 -0500   Tue, 17 Sep 2024 16:18:13 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Tue, 17 Sep 2024 16:15:10 -0500   Tue, 17 Sep 2024 16:18:13 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
  • Events
Events:
  Type     Reason             Age                  From             Message
  ----     ------             ----                 ----             -------
  Normal   Unconsolidatable   60m (x60 over 21h)   karpenter        Can't remove without creating 2 candidates
  Normal   DisruptionBlocked  45m (x5 over 6h48m)  karpenter        Cannot disrupt Node: pdb "kube-system/coredns" prevents pod evictions
  Normal   DisruptionBlocked  43m (x2 over 47m)    karpenter        Cannot disrupt Node: pdb "istio-system/istiod" prevents pod evictions
  Warning  ContainerGCFailed  41m (x7 over 47m)    kubelet          failed to read podLogsRootDirectory "/var/log/pods": open /var/log/pods: too many open files
  Warning  ImageGCFailed      40m                  kubelet          get filesystem info: Failed to get the info of the filesystem with mountpoint: cannot find filesystem info for device "/dev/nvme2n1p1"
  Normal   DisruptionBlocked  40m                  karpenter        Cannot disrupt Node: NodePool "generic" not found
  Normal   NodeNotReady       39m                  node-controller  Node ip-192-168-148-76.eu-central-1.compute.internal status is now: NodeNotReady
  • Another kind of Events (from other node who suffers the same)
Events:
  Type     Reason             Age                  From             Message
  ----     ------             ----                 ----             -------
  Normal   Unconsolidatable   27m (x116 over 35h)  karpenter        SpotToSpotConsolidation is disabled, can't replace a spot node with a spot node
  Warning  ContainerGCFailed  17m                  kubelet          failed to read podLogsRootDirectory "/var/log/pods": open /var/log/pods: too many open files
  Normal   NodeNotReady       16m (x2 over 2d9h)   node-controller  Node ip-192-168-145-111.eu-central-1.compute.internal status is now: NodeNotReady
  Normal   DisruptionBlocked  13m (x3 over 17m)    karpenter        Cannot disrupt Node: pdb "knative-eventing/eventing-webhook" prevents pod evictions
  • It seems the node-controller set the node into NotReady blocking (messing) the disruption logic from karpenter.
  • Node keeps in that state for ever (Never heal itself)
  • Pods in the node are not able to terminate (They are in Terminating state forever)
  • The only way for healing the node is using the aws cli to reboot the ec2 instance
  • 6 EKS clusters with the same behavior
  • I have seen clusters with 6 out of 12 nodes in that state
  • This behavior started from the weekend, last week we didn't saw this behavior

Expected Behavior:

  • Nodes are running and healthy
  • Karpenter disrupt a node following Node disruption budgets
  • Pods are moved into new nodes
  • Node is deleted
  • Pods are healthy in new nodes

Reproduction Steps (Please include YAML):

  • No human intervention to start watching the behavior

  • NodePool

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    karpenter.sh/nodepool-hash: "7878014288654516958"
    karpenter.sh/nodepool-hash-version: v3
  creationTimestamp: "2024-04-15T17:10:56Z"
  generation: 3
  labels:
    kustomize.toolkit.fluxcd.io/name: karpenter-configs
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: generic
  resourceVersion: "1328500814"
  uid: 1a4e3c12-4abb-4085-9698-8a8f791578f6
spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: 0s
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: 1000
  template:
    spec:
      expireAfter: 720h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: generic
      requirements:
      - key: karpenter.k8s.aws/instance-family
        operator: In
        values:
        - t3
        - t3a
        - m5
        - m5a
        - m6a
        - m6i
      - key: karpenter.k8s.aws/instance-size
        operator: In
        values:
        - large
        - xlarge
        - 2xlarge
      - key: topology.kubernetes.io/zone
        operator: In
        values:
        - eu-central-1a
        - eu-central-1b
        - eu-central-1c
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - spot
        - on-demand
  • EC2NodeClass
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  annotations:
    karpenter.k8s.aws/ec2nodeclass-hash: "10715260476625989988"
    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
  creationTimestamp: "2024-04-15T17:10:55Z"
  finalizers:
  - karpenter.k8s.aws/termination
  generation: 6
  labels:
    kustomize.toolkit.fluxcd.io/name: karpenter-configs
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: generic
  resourceVersion: "1328580165"
  uid: 1e658352-e2cb-4e33-ac47-85a27acf3104
spec:
  amiSelectorTerms:
  - alias: bottlerocket@latest
  blockDeviceMappings:
  - deviceName: /dev/xvda
    ebs:
      deleteOnTermination: true
      volumeSize: 4Gi
      volumeType: gp3
  - deviceName: /dev/xvdb
    ebs:
      deleteOnTermination: true
      volumeSize: 60Gi
      volumeType: gp3
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
  role: KarpenterNodeRole-prod
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: prod
  subnetSelectorTerms:
  - tags:
      Name: '*Private*'
      karpenter.sh/discovery: prod
  tags:
    nodepool: generic
    purpose: prod
  • The same behavior is observed with other nodepools/ec2nodeclases with different instance types including with gpu.

Versions:

  • Chart Version: 1.0.1
  • Kubernetes Version (kubectl version): 1.29

It is observed that managed nodes (we have three of them in each cluster) use bottlerocket 1.21 OS Image: Bottlerocket OS 1.21.1 (aws-k8s-1.29)

How ever it seems karpenter nodes are now using bottlerocket 1.22 OS Image: Bottlerocket OS 1.22.0 (aws-k8s-1.29)

And some times the nodes that are in NotReady state are reporting Unknown OS Image: Unknown

  • I wonder if this could be a Bottlerocket Issue or a mix between Bottlerocket or karpenter.
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

andrescaroc avatar Sep 17 '24 22:09 andrescaroc

Temporary solution is to pin bottlerocket version to 1.21.1

andrescaroc avatar Sep 20 '24 03:09 andrescaroc

Hey, I've got the same issue, but with

  • Karpenter Chart Version : 1.1.0
  • Kubernetes Version (kubectl version): 1.30, specifically v1.30.4-eks-a737599
  • OS Image: Amazon Linux 2

jacoblElementor avatar Dec 08 '24 11:12 jacoblElementor

@andrescaroc Are you still running into this issue? Have any clarity on the cause? I just ran into it across all of my clusters.

jammerful avatar Jan 24 '25 19:01 jammerful

Hello. We are facing exact same issue.

  • Karpenter Chart Version : 1.2.1
  • EKS: 1.31.4
  • OS Image: Amazon Linux 2023

Either i need to manually delete the pods (forcefully) or nodeclaim/node (forcefully). I even tried setting TerminationGracePeriod: 30s as a workaround, but no results. Mostly we find this issue on on-demand instances(obvious). I started encountering this issue after upgrading karpenter to 1.2.1(Been using karpenter since alpha)

Shreyank031 avatar Feb 07 '25 07:02 Shreyank031

Hi, We see the same issue in our clusters. This is seen for both bottlerocket and AL2 images

  • Karpenter version: 1.0.5
  • EKS version: v1.30.8
  • OS Image: Amazon Linux 2/ bottlerocket

Archandan avatar Feb 11 '25 04:02 Archandan

Same here:

We seen the problem in following combos:

Karpenter EKS Bottlerocket
1.2.1 1.31 1.30
1.2.1 1.31 1.31
1.2.1 1.32 1.32

Basically as long we stick on EKS 1.31 we so far are able to run stable with Bottlerocket 1.29, however to be able to bump to EKS 1.32 we will have to upgrade Bottlerocket.

Desperately looking for a solution!

One thing we found out is following:

failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config

marcofranssen avatar Feb 12 '25 09:02 marcofranssen

@jammerful Yes, I still have this issue, I see the problem in following set ups:

Karpenter EKS Bottlerocket
1.0.1 1.29 1.22.0
1.0.1 1.31 1.27.1

andrescaroc avatar Feb 12 '25 14:02 andrescaroc

Can anyone provide instance IDs and node names for particular nodes? Then we can take a closer look.

Sparksssj avatar Feb 12 '25 20:02 Sparksssj

Can anyone provide instance IDs and node names for particular nodes? Then we can take a closer look.

@Sparksssj

id: aws:///eu-central-1b/i-011e10919f8e6d533 name: ip-192-168-179-209.eu-central-1.compute.internal

andrescaroc avatar Feb 13 '25 18:02 andrescaroc

Can anyone provide instance IDs and node names for particular nodes? Then we can take a closer look.

@Sparksssj ids: i-08ee3489048ea9781, i-0e4d1d9ae05437854 + many more names: ip-172-31-64-156.ap-south-1.compute.internal, ip-172-31-116-107.ap-south-1.compute.internal + many more ami id: ami-0dce739b024d12140

Shreyank031 avatar Feb 17 '25 10:02 Shreyank031

I have been experiencing the exact same issue with bottlerocket nodes, Error message: Kubelet stopped sending node status, Node stays in 'NotReady' state forever, leaving pods on it in 'Terminating' state

  • Karpenter verison - 1.0.3
  • EKS version - 1.30
  • Bottlerocket OS version - latest
  • Node Type - On-demand
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Sun, 16 Feb 2025 01:40:57 -0500   Sun, 16 Feb 2025 01:43:02 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Sun, 16 Feb 2025 01:40:57 -0500   Sun, 16 Feb 2025 01:43:02 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Sun, 16 Feb 2025 01:40:57 -0500   Sun, 16 Feb 2025 01:43:02 -0500   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Sun, 16 Feb 2025 01:40:57 -0500   Sun, 16 Feb 2025 01:43:02 -0500   NodeStatusUnknown   Kubelet stopped posting node status.

dhavaln-able avatar Feb 17 '25 11:02 dhavaln-able

Observed the same issue. Instance id: i-004e3c50711039056 (ip-10-227-233-253.eu-central-1.compute.internal) AMI id: ami-06c1dd94700c2647e

  • Karpenter: 0.37.6
  • EKS version: 1.30
  • Bottlerocket: 1.32.0
  • On-demand nodes

bunicb avatar Feb 17 '25 12:02 bunicb

I notice @andrescaroc's NodePool excludes them, however on our clusters (see my colleague @marcofranssen's comment for specifics but we are on latest Karpenter/EKS/Bottlerocket versions) I can reproduce this on arm64 medium (8Gi) instance types. We do not observe this issue at all on nodes with >8Gi memory but that could be a red herring of course.

All of our workloads have memory requests/limits set, however when Karpenter tries to schedule a couple pods with moderate memory requests that would seem to fit comfortably, the node completely falls over shortly after startup:

[   35.112557] Memory cgroup out of memory: Killed process 2813 (runc:[2:INIT]) total-vm:1539692kB, anon-rss:2704kB, file-rss:1196kB, shmem-rss:5616kB, UID:0 pgtables:140kB oom_score_adj:-997
[   60.172244] watchdog: BUG: soft lockup - CPU#0 stuck for 27s! [systemd:1]
[   88.172091] watchdog: BUG: soft lockup - CPU#0 stuck for 53s! [systemd:1]
[   94.122058] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[   94.122543] rcu: All QSes seen, last rcu_sched kthread activity 5902 (4294946691-4294940789), jiffies_till_next_fqs=1, root ->qsmask 0x0
[   94.123471] rcu: rcu_sched kthread starved for 5902 jiffies! g10881 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[   94.124245] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[   94.124934] rcu: RCU grace-period kthread stack dump:
[   94.125356] rcu: Stack dump where RCU GP kthread last ran:
[  120.171915] watchdog: BUG: soft lockup - CPU#0 stuck for 82s! [systemd:1]
[  148.171762] watchdog: BUG: soft lockup - CPU#0 stuck for 108s! [systemd:1]
[  176.171609] watchdog: BUG: soft lockup - CPU#0 stuck for 135s! [systemd:1]
[  204.171455] watchdog: BUG: soft lockup - CPU#0 stuck for 161s! [systemd:1]
[  232.171302] watchdog: BUG: soft lockup - CPU#0 stuck for 187s! [systemd:1]
[  260.171148] watchdog: BUG: soft lockup - CPU#0 stuck for 213s! [systemd:1]
[  271.171089] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  271.171573] rcu: All QSes seen, last rcu_sched kthread activity 23607 (4294964396-4294940789), jiffies_till_next_fqs=1, root ->qsmask 0x0
[  271.172502] rcu: rcu_sched kthread starved for 23607 jiffies! g10881 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[  271.173279] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[  271.173966] rcu: RCU grace-period kthread stack dump:
[  271.174387] rcu: Stack dump where RCU GP kthread last ran:
[  296.170951] watchdog: BUG: soft lockup - CPU#0 stuck for 246s! [systemd:1]
[  324.170798] watchdog: BUG: soft lockup - CPU#0 stuck for 272s! [systemd:1]
[  352.170644] watchdog: BUG: soft lockup - CPU#0 stuck for 298s! [systemd:1]

At this point due to the CPU lockup the node is unreachable and of course the node remains stuck in NotReady status. I've tried and reverted a couple configuration changes that had no effect:

  • Adding systemReserved configuration for kubelet to reserve 100m CPU and 100Mi memory
  • Lowering kubeReserved memory from 1465Mi to 768Mi (I thought perhaps this is an excessive reservation for the smaller node type anyway, and extra free memory for workloads would help)

Edit: In our case the affected nodes are single CPU, so minus the reservation for kubelet there is only 840m allocatable. So the node behavior under CPU contention could also be a factor.

gordonm avatar Feb 17 '25 19:02 gordonm

Seems Bottlerocket team claims it is an issue upstream https://github.com/bottlerocket-os/bottlerocket/issues/4399#issuecomment-2657449645

marcofranssen avatar Feb 17 '25 21:02 marcofranssen

In our case the affected nodes are single CPU, so minus the reservation for kubelet there is only 840m allocatable. So the node behavior under CPU contention could also be a factor.

We have removed node smaller then 2 CPU and 4GB RAM from the node pool for now. This has reduced the number of instances of kubelet not posting updates but not removed it completely

Archandan avatar Feb 20 '25 11:02 Archandan

Talking to the Bottlerocket team to understand what's the next step for this issue.

saurav-agarwalla avatar Feb 26 '25 17:02 saurav-agarwalla

Possible Fix: Allocating Reserved Resources for Kubelet and System Processes

Hey everyone,

We were facing a similar issue where nodes (running Amazon Linux 2023) would randomly get stuck in a Terminating state, and Kubelet would stop posting node status updates. This was severely impacting our workloads as pods remained stuck indefinitely.

What Fixed It for Us

After some debugging, we found that Kubelet and system processes were getting starved of resources, leading to failures in node health updates and pod evictions. To mitigate this, we explicitly allocated reserved resources using the following Karpenter configuration:

  systemReserved:
    cpu: 200m
    memory: 200Mi
    ephemeral-storage: 1Gi
  kubeReserved:
    cpu: 200m
    memory: 400Mi
    ephemeral-storage: 3Gi

Would love to hear if others are facing the same issue and if similar configurations help in their environments. Hope this helps!

Shreyank031 avatar Feb 27 '25 10:02 Shreyank031

We actually wanted to increase the reservation only for smaller instances. i.e. < 4 core (i.e. 1 or 2 core nodes).

Unfortunately it seems Ec2nodeclass for Bottlerocket does not support scripting/templating in userData. It expects a valid TOML file. https://karpenter.sh/docs/concepts/nodeclasses/#bottlerocket-2

Supporting templating like the following, based on the nodeclaim labels could help us in selectively applying the reservation.

{{ if eq  .karpenter_k8s_aws_instance_cpu "1" "2" }}
[settings.kubernetes.system-reserved]
memory = "500Mi"
{{ end }}

Archandan avatar Feb 27 '25 13:02 Archandan

@Shreyank031 thanks for the suggestion, as mentioned upthread we use Bottlerocket nodes exclusively and had previously tried setting spec.kubelet.systemReserved fields. I set them to match your values, however still encountered NotReady nodes after forcing rescheduling with cordon/drain (we're running EKS 1.32, Karpenter 1.2.x and Bottlerocket 1.34.0). Two observations still stand out:

  • For non-burstable instance types, > 1 vCPU or >= 8Gi seems to be the minimum spec to prevent this condition
  • I don't observe this condition with burstable instances (t4g.medium instances and larger) perhaps only because they have 2 or more vCPUs?

@Archandan for Bottlerocket it appears the sum of the system/kube reserved memory fromspec.kublet is subtracted from total memory to set /sys/fs/cgroup/kubepods.slice/memory.max. It's not clear to me that CPU limits are enforced at all, which might explain how this is affecting lower-resource nodes or those experiencing high contention.

gordonm avatar Mar 05 '25 07:03 gordonm

@gordonm In our setup for some of these hung nodes we were able to see high memory usage. It is possible that CPU might become a bottleneck in some cases but for us it was mostly RAM.

I see there is RAM and CPU reservation defined in the file below, but I have not checked the same on a node to confirm https://github.com/bottlerocket-os/bottlerocket/blob/develop/sources/api/schnauzer/src/helpers.rs (#L1204, #L1137)

Archandan avatar Mar 06 '25 05:03 Archandan

Is there any further update/insight on this issue ?

Archandan avatar Apr 03 '25 04:04 Archandan

  systemReserved:
    cpu: 200m
    memory: 200Mi
    ephemeral-storage: 1Gi
  kubeReserved:
    cpu: 200m
    memory: 400Mi
    ephemeral-storage: 3Gi

Would love to hear if others are facing the same issue and if similar configurations help in their environments. Hope this helps!

This helped a lot but did not completely fix the issue.

ajhodgson avatar Apr 07 '25 15:04 ajhodgson

Is there any further update/insight on this issue ? We are facing similar issue with Karpenter. Karpenter Version : 1.3.3 AWS nodes (EKS): Graviton and AL2023 machines

cw-madhuripatil avatar Apr 18 '25 12:04 cw-madhuripatil

Following up: Our workaround is to follow the same requirements for minimum node size used by EKS Auto Mode. So our NodePool requirements now include:

- key: karpenter.k8s.aws/instance-cpu
  operator: Gt
  values:
  - "1"
- key: karpenter.k8s.aws/instance-memory
  operator: Gt
  values:
  - "4000"

This seems to prevent the resource starvation that leads to the issue reported here and has allowed us to update to the latest EKS and Bottlerocket releases.

gordonm avatar Apr 21 '25 15:04 gordonm

FWIW, we were encountering this issue primarily with c5.large instances -- we could not keep them online for longer than 10 minutes after startup.

After upping the NodePool requirements to using instance types with

- key: karpenter.k8s.aws/instance-memory
  operator: Gt
  values:
  - "4096"

(as c5.large has 4096), along with Shreyank031's fix above, we're no longer losing nodes.

jsumali-felix avatar May 01 '25 19:05 jsumali-felix

We still observe the issue:

Karpenter EKS Bottlerocket
1.0.1 1.29 1.22.0
1.0.1 1.31 1.27.1
1.4.0 1.31 1.37.0

Now, all our instances defined system-reserved and kube-reserved resources as suggested in previous messages:

...
  kubelet:
    systemReserved:
      cpu: 200m
      memory: 200Mi
      ephemeral-storage: 1Gi
    kubeReserved:
      cpu: 200m
      memory: 400Mi
      ephemeral-storage: 3Gi
...

Instance type observed: t3.2xlarge

Do we have updates about the issue or possible solutions?

andrescaroc avatar May 21 '25 15:05 andrescaroc

@andrescaroc we resolved it by only have karpenter use nodes with 2 or more cpus, we don't touch the kubelet reserved settings. It is far from ideal as this also results in nodes that are underutilized in our clusters.

Preferably I wish these kubelet reserved settings would get sane defaults per node size. E.g. more reserved for larger nodes, less reserved on smaller nodes (e.g. smaller nodes have less pods to coordinate).

It could be achieved by multiple nodeclasses (1 per instance size/type) and then multiple pools (1 for each nodeclass), however that requires someone to figure out the right kubelet values and such, and it will require every end user to define those node classes and defaults.

Maybe this thing is a candidate for another CRD, that can be used to inject kubelet reservations into ec2-nodeclasses. Thinking about this to simplify the enduser experience of not having to manage many ec2 nodeclasses.

[!Important] Maintainers: I think this problem should be The NR 1 priority, as the problem described in this issue is the main deal breaker for Karpenter and leads to broken clusters.

Any thoughts on my ideas above?

marcofranssen avatar Jul 03 '25 07:07 marcofranssen

Could it be this is resolving some of the problems we see here?

https://github.com/aws/karpenter-provider-aws/releases/tag/v1.5.2

marcofranssen avatar Jul 04 '25 11:07 marcofranssen

@marcofranssen Unfortunately, we saw the same behaviour even with v1.5.2.

woehrl01 avatar Jul 07 '25 08:07 woehrl01

I'm not sure if this is actually a OOM issue, in our case I just observed that the pod is not shutting down cleanly. The node already had a deletedTimestamp, but the finalizer of Karpenter has not been removed correctly.

see the logs, the node is stuck because karpenter still reports DisruptionBlocked and TerminationGracePeriodExpiring way after the TTL has been reached.

After removing the Finalizer manually, the node gets removed correctly by node-controller.

Image

woehrl01 avatar Jul 08 '25 08:07 woehrl01