bottlerocket Unresponsive/unreachable Bottlerocket EKS nodes

Hey folks,

Coming to you with an odd issue with Bottlerocket EKS nodes becoming unresponsive and unreachable, but still happily running as seen by EC2.

Image I'm using:

ami-089e696e7c541c61b (amazon/bottlerocket-aws-k8s-1.29-aarch64-v1.20.2-536d69d0)

System Info:
  Kernel Version:             6.1.90
  OS Image:                   Bottlerocket OS 1.20.2 (aws-k8s-1.29)
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.6.31+bottlerocket
  Kubelet Version:            v1.29.1-eks-61c0bbb
  Kube-Proxy Version:         v1.29.1-eks-61c0bbb

We use Karpenter as a node provisioner.

What I expected to happen:

Node runs smoothly.

What actually happened:

A large node (m7g.8xlarge) is running a memory-heavy JVM workload from a StatefulSet that takes up most of the machine (request=115Gi out of 128Gi).

Node runs happily for a while (~ 24h to several days), until it suddenly stops reporting status to Kubernetes:

Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Tue, 25 Jun 2024 15:18:44 +0200   Tue, 25 Jun 2024 15:20:38 +0200   NodeStatusUnknown   Kubelet stopped posting node status.

Kubernetes marks that node as NotReady and taints it with node.kubernetes.io/unreachable, then tries to delete the pod running our large workload. That doesn't work (the node is fully unresponsive), so that pod is stuck in status Terminating.

The node is unreachable via kubectl debug node ... (which times out), or via AWS SSM (which complains of "SSM Agent is not online", "Ping status: Connection lost"). The EC2 Serial Console is empty (all black). We do not have SSH enabled on these machines.

However, the node still appears as running from the EC2 console, reachability metrics are green, and we can see CPU / network metrics flowing in.

I've managed to get system logs by taking an EBS snapshot of the Bottlerocket data volume and restoring it to a separate volume for investigation. This was not helpful unfortunately: logs appear normal until the time the node dies, then suddenly stop. There is no indication that anything in particular (kubelet, containerd, etc) crashed and brought the instance down, but also, suspisciously, no logs at all from the moment the node went unresponsive.

How to reproduce the problem:

No clear way to reproduce unfortunately; we've seen this happen sporadically on maybe half a dozen instances over the past few weeks, out of several hundred

This issue is very annoying: I don't mind having pods crashing and/or getting evicted sometimes, or even kubelet/containerd crashing, but I'd expect it to self-heal eventually. This causes our workloads to get stuck, and we have to manually delete the pod and/or the node to get it back to normal. But even worse, I can't see a way to debug it properly or get to the bottom of it.

Would you have any idea of a better way to debug this?

Thank you!

(Note: I also opened an AWS support ticket to see if there's any AWS-level issue at play here, but this seems to happen only on this specific workload on Bottlerocket nodes, so I suspect something is off here)

Jun 26 '24 10:06 Pluies

Thanks for opening this issue @Pluies. This does look concerning and we would like to get to the bottom of this! I'll look internally for that support ticket to see if we can find out what is happening.

Jun 26 '24 16:06 yeazelm

Okay, I've got good news - after 13 days of being in this state, one node came back to life! Well, kind of. We received a NodeClockNotSynchronising alert, and realised the prometheus node exporter started responding to scrapes again (???).

Apart from that, no change. The node is still showing as NotReady and unreachable, not reachable via kubectl debug or AWS SSM, etc.

I've taken a new snapshot of the bottlerocket data volume, and I'm seeing new logs in the journal though; attaching it here. journal-after-rebirth.log

I think this has the meat of it (kernel messages + OOM kills).

There are some more messages later showing kubelet errors; I assume the node is so far gone at this stage that it's unable to refresh its credentials and rejoin the cluster.

Jun 26 '24 17:06 Pluies

Those logs do help. It looks like this could be related to an EBS availability since it seems to be related to hanging tasks with the storage driver:

INFO: task jbd2/nvme2n1-8:6604 blocked for more than 122 seconds.
kernel: fbcon: Taking over console
kernel:       Not tainted 6.1.90 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:jbd2/nvme2n1-8  state:D stack:0     pid:6604  ppid:2      flags:0x00000008
kernel: Call trace:
kernel:  __switch_to+0xbc/0xf0
kernel:  __schedule+0x258/0x6e0
kernel:  schedule+0x58/0xf0
kernel:  jbd2_journal_wait_updates+0x88/0x130
kernel:  jbd2_journal_commit_transaction+0x264/0x1bc0
kernel:  kjournald2+0x15c/0x470
kernel:  kthread+0xd0/0xe0

Can you confirm how many instances this happened to and what storage type you are using for your volumes? I'll keep digging in as best I can as well!

Jun 27 '24 02:06 yeazelm

Sure, storage-wise the machine has 4 EBS volumes attached:

For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)
For our workload: two large PVCs provisioned by the AWS EBS CSI driver. Both are gp3, one is formatted as ext4 (2000GB), the other is xfs (500GB)
Additionally, the workload mounts an EFS volume and it's configured to write a java heap dump to this volume in case of Java OOM

To keep everyone in sync, AWS Support seems to think this is due to memory pressure. The large workload has no memory limit; so this is definitely a possibility. We've been working off the assumption that if they do use too much memory, we would still be protected by the OOMKiller and other limits such as kube-reserved and system-reserved memory; but I realise this assumption might be wrong.

The metrics also showed this re memory pressure:

# HELP node_pressure_cpu_waiting_seconds_total Total time in seconds that processes have waited for CPU time
# TYPE node_pressure_cpu_waiting_seconds_total counter
node_pressure_cpu_waiting_seconds_total 683567.871622
# HELP node_pressure_io_stalled_seconds_total Total time in seconds no process could make progress due to IO congestion
# TYPE node_pressure_io_stalled_seconds_total counter
node_pressure_io_stalled_seconds_total 3823.9046439999997
# HELP node_pressure_io_waiting_seconds_total Total time in seconds that processes have waited due to IO congestion
# TYPE node_pressure_io_waiting_seconds_total counter
node_pressure_io_waiting_seconds_total 18942.701305
# HELP node_pressure_memory_stalled_seconds_total Total time in seconds no process could make progress due to memory congestion
# TYPE node_pressure_memory_stalled_seconds_total counter
node_pressure_memory_stalled_seconds_total 1.141417592027e+06
# HELP node_pressure_memory_waiting_seconds_total Total time in seconds that processes have waited for memory
# TYPE node_pressure_memory_waiting_seconds_total counter
node_pressure_memory_waiting_seconds_total 1.141678870877e+06

The memory-related stall, 1.141678870877e+06 seconds, corresponds to just over 13 days, which corresponds pretty well to how long the node was dead for. This might just be a coincidence, if this is shared by all processes, but interesting nonetheless.

On another node (not dead 😄 ) with the same workload, these metrics come up as:

# HELP node_pressure_cpu_waiting_seconds_total Total time in seconds that processes have waited for CPU time
# TYPE node_pressure_cpu_waiting_seconds_total counter
node_pressure_cpu_waiting_seconds_total 10647.531140000001
# HELP node_pressure_io_stalled_seconds_total Total time in seconds no process could make progress due to IO congestion
# TYPE node_pressure_io_stalled_seconds_total counter
node_pressure_io_stalled_seconds_total 5830.272114
# HELP node_pressure_io_waiting_seconds_total Total time in seconds that processes have waited due to IO congestion
# TYPE node_pressure_io_waiting_seconds_total counter
node_pressure_io_waiting_seconds_total 10842.729424000001
# HELP node_pressure_memory_stalled_seconds_total Total time in seconds no process could make progress due to memory congestion
# TYPE node_pressure_memory_stalled_seconds_total counter
node_pressure_memory_stalled_seconds_total 4560.770407999999
# HELP node_pressure_memory_waiting_seconds_total Total time in seconds that processes have waited for memory
# TYPE node_pressure_memory_waiting_seconds_total counter
node_pressure_memory_waiting_seconds_total 4941.830446999999

Anyway, I'm now adding a memory limit to the large workload, so if this is memory-related it should fix the issue for us. 👍

If this is the case though, is it still potentially a bottlerocket issue? I'd have expected either a full crash or eventual self-healing, but we ended up in this half-broken state instead. On the other hand, it would also be reasonable for Bottlerocket to not support unbounded workloads.

Jun 27 '24 10:06 Pluies

For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)

One thing I want to call out here just before we lose it in the rest. I don't think you are benefiting from making the root volume 50GB. By default, Bottlerocket won't expand to use anything more than the default size (typically 2GB) of the root volume. Everything that might need variable storage space should end up on the second volume. I'd like to double check if you could save on some EBS costs by avoiding the addition 48GB of unused space.

AWS Support seems to think this is due to memory pressure.

I do think the memory pressure is causing some of this problem, but I don't think it should result in the node ending up in this state. The fact that the rest of the host is starved, even the storage layer, feels off at first glance. I'd like to see if memory limit helps you avoid this but I still think there is something odd going on here. The OOM killer should be able to recover and protect the rest of the OS from situations like this. Do you have any special configuration for this workload like setting additional privileges or capabilities?

Jun 28 '24 15:06 yeazelm

For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)

One thing I want to call out here just before we lose it in the rest. I don't think you are benefiting from making the root volume 50GB. By default, Bottlerocket won't expand to use anything more than the default size (typically 2GB) of the root volume. Everything that might need variable storage space should end up on the second volume. I'd like to double check if you could save on some EBS costs by avoiding the addition 48GB of unused space.

Thank you for that! I'll bring it down 👍

AWS Support seems to think this is due to memory pressure.

I do think the memory pressure is causing some of this problem, but I don't think it should result in the node ending up in this state. The fact that the rest of the host is starved, even the storage layer, feels off at first glance. I'd like to see if memory limit helps you avoid this but I still think there is something odd going on here. The OOM killer should be able to recover and protect the rest of the OS from situations like this. Do you have any special configuration for this workload like setting additional privileges or capabilities?

None in particular, here's the podspec:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    dune.com/tagger: dune:subproject=cluster-1
  creationTimestamp: "2024-06-28T14:15:40Z"
  generateName: cluster-1-q5mjg-1-workers-
  labels:
    app.kubernetes.io/created-by: trino-operator
    app.kubernetes.io/instance: cluster-1-q5mjg-1
    app.kubernetes.io/name: trinocluster
    app.kubernetes.io/part-of: trino-operator
    app.kubernetes.io/role: worker
    apps.kubernetes.io/pod-index: "9"
    controller-revision-hash: cluster-1-q5mjg-1-workers-5d877d47f9
    dune.com/clusterset: cluster-1
    statefulset.kubernetes.io/pod-name: cluster-1-q5mjg-1-workers-9
  name: cluster-1-q5mjg-1-workers-9
  namespace: query-engine
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: cluster-1-q5mjg-1-workers
    uid: b8b5ba7c-4457-4216-9aa7-31613391ab38
  resourceVersion: "2297524264"
  uid: f5f3ec89-5cdf-4eca-9bd1-d130bd36b1a2
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: dunetech.io/nodegroup
            operator: In
            values:
            - trino
          - key: node.kubernetes.io/instance-type
            operator: In
            values:
            - m7g.8xlarge
          - key: karpenter.sh/capacity-type
            operator: In
            values:
            - on-demand
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - eu-west-1b
        - matchExpressions:
          - key: dunetech.io/nodegroup
            operator: In
            values:
            - trino
          - key: node.kubernetes.io/instance-type
            operator: In
            values:
            - m6g.8xlarge
          - key: karpenter.sh/capacity-type
            operator: In
            values:
            - on-demand
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - eu-west-1b
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/created-by: trino-operator
            app.kubernetes.io/instance: cluster-1-q5mjg-1
            app.kubernetes.io/name: trinocluster
            app.kubernetes.io/part-of: trino-operator
            app.kubernetes.io/role: worker
        topologyKey: kubernetes.io/hostname
  containers:
  - env:
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: regional
    - name: AWS_DEFAULT_REGION
      value: eu-west-1
    - name: AWS_REGION
      value: eu-west-1
    - name: AWS_ROLE_ARN
      value: arn:aws:iam::aws-account-id-redacted:role/prod_eks_trino
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    image: aws-account-id-redacted.dkr.ecr.eu-west-1.amazonaws.com/trino:2024-06-28T08-59-33-main-57ed4c8
    imagePullPolicy: IfNotPresent
    livenessProbe:
      failureThreshold: 20
      httpGet:
        path: /v1/info
        port: http
        scheme: HTTP
      initialDelaySeconds: 20
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 60
    name: trino
    ports:
    - containerPort: 8080
      name: http
      protocol: TCP
    - containerPort: 9000
      name: metrics
      protocol: TCP
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /v1/info
        port: http
        scheme: HTTP
      initialDelaySeconds: 5
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    resources:
      requests:
        cpu: "30"
        memory: 115Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /etc/trino
      name: config-volume
    - mountPath: /etc/trino/catalog
      name: catalog-volume
    - mountPath: /data/view-overrides
      name: view-overrides
      readOnly: true
    - mountPath: /etc/trino-event-listeners/event-listener.properties
      name: dune-event-listener
      readOnly: true
      subPath: event-listener.properties
    - mountPath: /cache
      name: cache
    - mountPath: /etc/trino-resource-groups/resource-groups.json
      name: resource-groups
      readOnly: true
      subPath: resource-groups.json
    - mountPath: /heapdumps/
      name: heapdumps
    - mountPath: /var/lib/trino/prometheus-exporter
      name: prometheus-exporter-config-volume
    - mountPath: /var/lib/trino/spill
      name: spilltodisk
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-4mxl5
      readOnly: true
    - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
      name: aws-iam-token
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: cluster-1-q5mjg-1-workers-9
  nodeName: ip-10-0-56-162.eu-west-1.compute.internal
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 1000
    runAsGroup: 1000
    runAsUser: 1000
  serviceAccount: trino
  serviceAccountName: trino
  subdomain: cluster-1-q5mjg-1-workers
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoSchedule
    key: dunetech.io/nodegroup
    operator: Equal
    value: trino
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token
  - name: cache
    persistentVolumeClaim:
      claimName: cache-cluster-1-q5mjg-1-workers-9
  - name: spilltodisk
    persistentVolumeClaim:
      claimName: spilltodisk-cluster-1-q5mjg-1-workers-9
  - configMap:
      defaultMode: 420
      name: cluster-1-q5mjg-1-workers
    name: config-volume
  - name: catalog-volume
    secret:
      defaultMode: 420
      secretName: cluster-1-catalogs-ee5f0
  - configMap:
      defaultMode: 420
      name: trino-view-overrides
    name: view-overrides
  - name: dune-event-listener
    secret:
      defaultMode: 420
      secretName: cluster-1-event-listener-7476a
  - configMap:
      defaultMode: 420
      name: cluster-1-resource-group-a800e
    name: resource-groups
  - name: heapdumps
    persistentVolumeClaim:
      claimName: trino-heapdumps
  - configMap:
      defaultMode: 420
      name: cluster-1-prometheus-exporter
    name: prometheus-exporter-config-volume
  - name: kube-api-access-4mxl5
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

Since we added limits to the podspec above (3 days ago) nodes have been solid. I'll report back here later to confirm whether this fixed everything 👍

I was wondering whether it could be replicated, so I ran a stress-ng pod with request=115Gi and no limit and made it eat up memory... But no, everything behaves well in this case: the pod is OOMKilled once it reaches the threshold as expected, and the node stays healthy.

Jul 01 '24 07:07 Pluies

Thanks for the detailed response @Pluies . It is curious that stress-ng didn't reproduce this but your workload does. Can you clarify how you set limits so that if others run across this issue, they have a possible work around to try? I'd like to document this since it seems like the memory limit might be a good way to deal with this.

I'm unsure what could cause this issue but there are a few things we could try to help hone in what might be the problem. It might be curious to see if changing the priority: 0 or preemptionPolicy: PreemptLowerPriority could change this behavior. I'm also curious to see if cgroupsv2 could have some nuance, using an earlier k8s variant (1.25 or earlier) or changing back to cgroups v1 might a useful experiment. One other experiment we might try is using the AL EKS Optimized AMI to see if it reproduces there as well (since they can share the same kernel but differ in a few minor ways). You have already provided a ton of info so thank you and no pressure to try these out, I'm mostly documenting these for next steps on what to try next since we still haven't found out why the nodes hang like they do under this memory pressure.

Jul 02 '24 03:07 yeazelm

Thanks for the detailed response @Pluies . It is curious that stress-ng didn't reproduce this but your workload does. Can you clarify how you set limits so that if others run across this issue, they have a possible work around to try? I'd like to document this since it seems like the memory limit might be a good way to deal with this.

Absolutely, our fix was to add the limit block to our resource declaration in the podspec, i.e.:

    resources:
      requests:
        cpu: "30"
        memory: 115Gi
      limits:
        memory: 118Gi

The value of 118Gi was picked so as to be very close to the full amount of allocatable memory on the node, as retrieved with kubectl describe node. It can also be computed manually by taking the node memory and subtracting the kube-reserved memory and system-reserved memory and the memory eviction threshold.

I'm unsure what could cause this issue but there are a few things we could try to help hone in what might be the problem. It might be curious to see if changing the priority: 0 or preemptionPolicy: PreemptLowerPriority could change this behavior. I'm also curious to see if cgroupsv2 could have some nuance, using an earlier k8s variant (1.25 or earlier) or changing back to cgroups v1 might a useful experiment. One other experiment we might try is using the AL EKS Optimized AMI to see if it reproduces there as well (since they can share the same kernel but differ in a few minor ways). You have already provided a ton of info so thank you and no pressure to try these out, I'm mostly documenting these for next steps on what to try next since we still haven't found out why the nodes hang like they do under this memory pressure.

I'm afraid I'm not going to be able to test these here as the issues happen on our prod environments and would directly impact paying customers, but these are good avenues of investigation for whoever wants to pick this up in the future 🙏

Thank you for your help @yeazelm !

Jul 02 '24 11:07 Pluies

We have noticed this happening in one of our clusters in the last week as well. A few observations:

We use Karpenter.
The problem nodes have all been running Bottlerocket OS 1.20.3.
~We do not suspect any high cpu/memory pressure on these nodes. Workloads are variable but it's in a Dev cluster where most things have very little load.~ Now that I'm looking deeper into this, I have noticed that many of the pods where this happens have memory pressure and/or SystemOOM events immediately before they crash. So it does appear to be memory related.
~All~ Most of the nodes that I've checked so far have been Graviton, typically latest gen (e.g. c7g). We did see this happen on a c5.xlarge too.
The node will stay running in a NotReady state indefinitely until someone cleans it up manually. It doesn't seem like kubelet is able to recover (i.e. by restarting).

Example node in bad state:

Name:               ip-10-2-101-118.us-east-2.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=arm64
                    beta.kubernetes.io/instance-type=c7g.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2c
                    k8s.io/cloud-provider-aws=e1616aad6d45db84259aa7890511291f
                    karpenter.k8s.aws/instance-category=c
                    karpenter.k8s.aws/instance-cpu=4
                    karpenter.k8s.aws/instance-cpu-manufacturer=aws
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=true
                    karpenter.k8s.aws/instance-family=c7g
                    karpenter.k8s.aws/instance-generation=7
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-memory=8192
                    karpenter.k8s.aws/instance-network-bandwidth=1876
                    karpenter.k8s.aws/instance-size=xlarge
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/nodepool=default
                    karpenter.sh/registered=true
                    kubernetes.io/arch=arm64
                    kubernetes.io/hostname=ip-10-2-x-x.us-east-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=c7g.xlarge
                    role=application
                    topology.ebs.csi.aws.com/zone=us-east-2c
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2c
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.2.x.x
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-redacted","efs.csi.aws.com":"i-redacted"}
                    karpenter.k8s.aws/ec2nodeclass-hash: 13086204471037975526
                    karpenter.k8s.aws/ec2nodeclass-hash-version: v2
                    karpenter.sh/nodepool-hash: 647773607843705194
                    karpenter.sh/nodepool-hash-version: v2
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 02 Jul 2024 17:46:19 -0700
Taints:             node.kubernetes.io/unreachable:NoExecute
                    node.kubernetes.io/unreachable:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-2-x-x.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 03 Jul 2024 11:28:47 -0700
Conditions:
  Type             Status    LastHeartbeatTime                 LastTransitionTime                Reason              Message
  ----             ------    -----------------                 ------------------                ------              -------
  MemoryPressure   Unknown   Wed, 03 Jul 2024 11:24:16 -0700   Wed, 03 Jul 2024 11:25:45 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  DiskPressure     Unknown   Wed, 03 Jul 2024 11:24:16 -0700   Wed, 03 Jul 2024 11:25:45 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  PIDPressure      Unknown   Wed, 03 Jul 2024 11:24:16 -0700   Wed, 03 Jul 2024 11:25:45 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
  Ready            Unknown   Wed, 03 Jul 2024 11:24:16 -0700   Wed, 03 Jul 2024 11:25:45 -0700   NodeStatusUnknown   Kubelet stopped posting node status.
Addresses:
  InternalIP:   10.2.x.x
  InternalDNS:  ip-10-2-x-x.us-east-2.compute.internal
  Hostname:     ip-10-2-x-x.us-east-2.compute.internal
Capacity:
  cpu:                4
  ephemeral-storage:  81854Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             7947912Ki
  pods:               58
Allocatable:
  cpu:                3920m
  ephemeral-storage:  76173383962
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  hugepages-32Mi:     0
  hugepages-64Ki:     0
  memory:             6931080Ki
  pods:               58
System Info:
  Machine ID:                 ec2xxxx8bb
  System UUID:                ecxxxxbb
  Boot ID:                    92xxxx16
  Kernel Version:             6.1.92
  OS Image:                   Bottlerocket OS 1.20.3 (aws-k8s-1.29)
  Operating System:           linux
  Architecture:               arm64
  Container Runtime Version:  containerd://1.6.31+bottlerocket
  Kubelet Version:            v1.29.1-eks-61c0bbb
  Kube-Proxy Version:         v1.29.1-eks-61c0bbb
ProviderID:                   aws:///us-east-2c/i-06xxxx
Non-terminated Pods:          (24 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                    ------------  ----------  ---------------  -------------  ---
  aws-controllers             s3-tc-controller-manager-64bddc98cc-fknbs               15m (0%)      1 (25%)     128Mi (1%)       256Mi (3%)     12h
  brave-quilt                 api-redis-master-0                                      200m (5%)     500m (12%)  192Mi (2%)       640Mi (9%)     3h59m
  brave-quilt                 authentication-service-redis-master-0                   200m (5%)     500m (12%)  192Mi (2%)       512Mi (7%)     112m
  brave-quilt                 securities-service-redis-master-0                       200m (5%)     500m (12%)  192Mi (2%)       512Mi (7%)     3h59m
  bright-wrench               dataplatform-airflow-worker-0                           200m (5%)     1 (25%)     1088Mi (16%)     2432Mi (35%)   4h39m
  calm-gender                 securities-service-5cdc94bf8c-2hj2g                     700m (17%)    750m (19%)  1216Mi (17%)     1792Mi (26%)   17h
  exit-karate                 integrations-service-mysql-0                            100m (2%)     500m (12%)  64Mi (0%)        384Mi (5%)     12h
  exit-karate                 trading-api-redis-master-0                              200m (5%)     500m (12%)  128Mi (1%)       448Mi (6%)     12h
  infra-test                  securities-service-6c44fc7fc-qlgmg                      700m (17%)    750m (19%)  1216Mi (17%)     1792Mi (26%)   12h
  jonesgang                   integrations-service-mysql-0                            100m (2%)     500m (12%)  64Mi (0%)        384Mi (5%)     12h
  kube-system                 aws-node-jhbcv                                          50m (1%)      0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 ebs-csi-node-w5cn2                                      30m (0%)      0 (0%)      120Mi (1%)       768Mi (11%)    17h
  kube-system                 efs-csi-node-thtt9                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 eks-pod-identity-agent-bz5xp                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 kube-proxy-wm5rg                                        100m (2%)     0 (0%)      0 (0%)           0 (0%)         17h
  kube-system                 node-local-dns-z5dnh                                    25m (0%)      0 (0%)      5Mi (0%)         0 (0%)         17h
  legal-horsea                integrations-service-mysql-0                            100m (2%)     500m (12%)  64Mi (0%)        384Mi (5%)     28m
  mlaskowski-cc2              shared-zookeeper-2                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         12h
  monitoring                  aws-for-fluent-bit-52jf5                                50m (1%)      0 (0%)      50Mi (0%)        250Mi (3%)     17h
  monitoring                  kube-prometheus-stack-prometheus-node-exporter-876f5    0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
  quirky-laundry              shared-zookeeper-0                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         5h41m
  sstrozier                   securities-service-57f9db5bfb-wwgr2                     700m (17%)    750m (19%)  1216Mi (17%)     1792Mi (26%)   17h
  worst-druid                 authentication-service-redis-master-0                   200m (5%)     500m (12%)  192Mi (2%)       512Mi (7%)     4h39m
  worst-druid                 shared-zookeeper-1                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         28m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                3870m (98%)   8250m (210%)
  memory             6127Mi (90%)  12858Mi (189%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
  hugepages-32Mi     0 (0%)        0 (0%)
  hugepages-64Ki     0 (0%)        0 (0%)
Events:
  Type    Reason                   Age                    From             Message
  ----    ------                   ----                   ----             -------
  Normal  DisruptionBlocked        31m (x7 over 17h)      karpenter        Cannot disrupt Node: Nominated for a pending pod
  Normal  Unconsolidatable         25m                    karpenter        Can't remove without creating 2 candidates
  Normal  DisruptionBlocked        19m                    karpenter        Cannot disrupt Node: PDB "infra-test/securities-service" prevents pod evictions
  Normal  DisruptionBlocked        15m (x2 over 12h)      karpenter        Cannot disrupt Node: PDB "mlaskowski-cc2/shared-zookeeper" prevents pod evictions
  Normal  DisruptionBlocked        13m (x3 over 28m)      karpenter        Cannot disrupt Node: PDB "worst-druid/shared-zookeeper" prevents pod evictions
  Normal  NodeNotReady             10m                    kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeNotReady
  Normal  NodeHasNoDiskPressure    10m (x3 over 17h)      kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientMemory  10m (x3 over 17h)      kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal  NodeHasSufficientPID     10m (x3 over 17h)      kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal  NodeReady                10m (x2 over 17h)      kubelet          Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeReady
  Normal  DisruptionBlocked        9m28s (x3 over 17m)    karpenter        Cannot disrupt Node: PDB "sstrozier/securities-service" prevents pod evictions
  Normal  NodeNotReady             8m55s (x2 over 14m)    node-controller  Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeNotReady
  Normal  DisruptionBlocked        5m18s (x2 over 7m23s)  karpenter        Cannot disrupt Node: PDB "calm-gender/securities-service" prevents pod evictions

Jul 03 '24 18:07 jessebye

We've also seen these symptoms happening today on our cluster for the first time. Also running Bottlerocket OS 1.20.3. Based on the container_memory_working_set_bytes analysis for the node before it went out, it seems that there were major container memory usage step ups for the kubernetes-nodes-cadvisor job before the node finally runs out of memory.

Jul 04 '24 10:07 igorbdl

@jessebye , and @igorbdl ,

Thank you for bringing this issue to our attention. To ensure we’re not dealing with multiple issues stemming from the same underlying problem, could you please provide additional details about your test environment? Specifically, it would be helpful to know the versions of EKS and Bottlerocket where your setup was functioning correctly, as well as the versions where you’re encountering issues. Additionally, have you attempted the solution suggested in this discussion thread involving setting a resource limit? If so, did it resolve your issue? Also, does your workload involve a memory-intensive JVM process?

Jul 09 '24 17:07 monirul

@monirul We saw the issue with EKS 1.29 and Bottlerocket 1.20.3. We've only recently switched to using Bottlerocket, so I don't have a lot of data about other versions.

Regarding the resource limit, yes we found that setting a memory limit for the pod resolved the issue. In our case, we determined it was an Argo CD pod that was causing the issue, so not JVM (Argo is Golang).

Jul 15 '24 22:07 jessebye

We are seeing this as well. We opened up an AWS support case (@bcressey you can ping me privately for the ticket number) and uploaded node logs to that case. ~Rolling back to Bottlerocket 1.19.5 / 1.29 solved the issue. I want to add that we not only saw the containerd_shim deadlocking, but we had significant performance degradation of our workloads on the nodes.~

(spoke too soon.. rolling back to 1.19.5/1.30 did not fix the issue.. still troubleshooting)

Jul 17 '24 01:07 diranged

Update: Our issue stemmed not from the Bottlerocket version, but actually from hammering the local root volume on very old EC2 instances. Our compute management partner launched r5b.24xl instances to satisfy a workload mistakenly, and that workload was writing extremely high amounts of I/O to the local ephemeral storage in our pods.

The interesting thing here is that only these r5b instances exhibited the problems with kubelet becoming non responsive and containderd_shim reporting a deadlock. Other instance were ... slow... but they worked and didn't fundamentally crash.

Jul 17 '24 04:07 diranged

I've been able to reproduce the failure mode shared by both @diranged and @Pluies, and I've written up those reproduction steps in this gist: https://gist.github.com/ginglis13/bf37a32d0f2a70b4ac5d9b9e5a960278

tl;dr: overwhelm a large node (like m7g.8xlarge or r5b.24xlarge) with memory pressure and high disk I/O, and you'll start seeing dmesg logs like:

INFO: task FOO blocked for more than X seconds.

Important to note, I've run the same reproduction on the latest Amazon Linux 2 EKS 1.30 Optimized AMI (ami-0cfd96d646e5535a8) and observed the same failure mode and node state.

I'm going to start bisecting the problem by running this repro across a few Bottlerocket variants to see if there's changes that landed in a specific Bottlerocket version that are causing this, especially something in the kernels / container runtimes.

I've found a few blog posts on this specific error message from the kernel:

https://vivani.net/2020/02/08/task-blocked-for-more-than-120-seconds/
https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/

Which suggest the settings vm.dirty_ratio and vm.dirty_backgroud_ratio could be tuned to be more aggressive in flushing data to disk in order to avoid this problem. Bottlerocket already sets more aggressive values for these settings than their defaults:

bash-5.1# sysctl -n vm.dirty_background_ratio
10
bash-5.1# sysctl -n vm.dirty_ratio
20

I'm working to run the repro with these settings set to even more aggressive values to see if that could be another workaround, other than the pod resource request limits suggested by @Pluies .

@jessebye @igorbdl Any more specific logs or details you could share about your respective encounters with this issue would be much appreciated in tracking this down. I'm particularly interested in:

Do you see "INFO: task FOO blocked for more than X seconds." in dmesg on your nodes too? I want to make sure you're running into the same class of problem (which it sounds like is the case from what you've shared so far, but this log message is the current most common denominator)
Are your workloads I/O intensive?
Specifically for @igorbdl , what was the last known version of Bottlerocket that was working for you?

Jul 18 '24 17:07 ginglis13

Hey @ginglis13 :wave: Great news that you were able to replicate

I've reviewed the job scheduling and memory limits to avoid this problem so I haven't seen this happening anymore and I haven't collected node logs at the time, so unfortunately I won't be able to help with more details on our occurrences. I'm sorry about that.

It seems we started having this issue since we upgraded from 1.20.2 to 1.20.3, but this sounds odd as other users have reported this issue with 1.20.2, so I could be wrong on this one.

We probably had this issue on our end of node system OOM for quite a while, but the nodes would be replaced as this would happen. The symptoms that lead me to reach out in this issue was that the node would not be flushed by EKS and the ones affected would remain in the cluster in NotReady status indefinitely and I'd have to manually delete the respective EC2 instance and delete the node on the cluster to clear things up. This issue also only seemed to happen with nodes that had pods with PVC's bounded.

Jul 18 '24 18:07 igorbdl

The kernel message INFO: task FOO blocked for more than XXX seconds means:

The task (process) is in state D also known as uninterruptible. Some types of iowait will put a process into this state.
- Some nominally uninterruptible processes can be killed, as long as the signal is fatal and not intercepted by a user-space signal handler. The hung task detector ignores these processes.
The task has not had a context switch in the last XXX seconds.

Since something that hangs for two minutes will quite likely hang forever, the kernel can optionally panic and reboot when the hung task detector fires. If this is a desirable outcome, see these sysctl options:

kernel.hung_task_panic=1: the hung task detector will trigger a panic.
kernel.panic=1: a panic will trigger reboot after one second.

If your applications are more robust when a node reboots than when a node hangs with unresponsive storage, this may help.

Jul 26 '24 19:07 larvacea

If one wants additional logging:

Setting the kernel parameter kernel.hung_task_all_cpu_backtrace=1 will add a backtrace-per-CPU to the logging from the hung task detector.
Setting kernel.hung_task_panic=1 calls debug_show_all_locks() before the call to panic().

Jul 26 '24 23:07 larvacea

Hello all:

We're also affected by this issue. We've recently adopted Bottlerocket in a subset of clusters, and have seen at least 5 instances of this failure.

The nodes it has affected have been

v1.28.10-eks-890c2ac
Bottlerocket OS 1.21.1 (aws-k8s-1.28)
6.1.102
containerd://1.7.20+bottlerocket

Since we've only recently adopted Bottlerocket, we don't have data to compare the frequency of this issue between different versions of Bottlerocket. However, I don't yet have reason to believe this is a Bottlerocket-specific problem. We've seen node lockups with similar symptoms for a few years now (at least since 2021), through many versions of k8s and the Kernel, across instance types, across architectures, on AL 2 nodes. I recall that we've seen this affect AL2023 too, but I am less certain of the details for AL2023. For AL2, we previously opened support case 171149566301465, but it did not progress.

For both AL 2 and Bottlerocket, the manifestation has been similar:

There is memory pressure on the node
Cache memory is reclaimed
Cache memory hits some critically low value (which varies)
EBS disk IO skyrockets, the node locks up, and the control plane begins reporting NotReady

In most cases the node does not recover and must be terminated. In a few cases the node does recover (in the sense that it is marked Ready and becomes responsive again), but when this happens the node is often broken in some variable and hard-to-discover/hard-to-diagnose way - for example, kubelet and the container runtime will lose sync, a subset of network connections will fail, or containerized application processes will be in some unexpected state.

My theory is that the OOMKiller is essentially not kicking in fast enough when there's memory pressure, and that the extreme reclaim of cache memory leads to cache misses that force lots of disk IO to hit EBS. The EBS iops limit is hit, and then the system stalls on iowait. Eventually the condition improves (if/when the OOMKiller kicks in), but the extended stall and potentially subsequent system OOMKills break kubelet and other services, leading node failure.

Here are some metric visualizations of the problem, this one for a Bottlerocket instance. The AL 2 instance visualizations look similar, although they only have a single EBS volume.

Let me know if there are other metrics I can provide or experiments I can run to help get to the bottom of this issue. It'd be such a relief to not have to deal with the consequences of this recurring issue.

Aug 29 '24 23:08 JacobHenner

Chiming in to say I've also experienced this same thing. Bottlerocket with EKS (1.28) and Karpenter (latest). Node starts to experience memory pressure and then enters a NotReady state and stays there indefinitely until the underlying EC2 is terminated.

I've been able to repro with test pods running stress, but not consistently. Trying to figure out a way to consistently reproduce the issue

Sep 10 '24 21:09 James-Quigley

I am facing the same issue here. The setup I have is:

EC2 instance type: t3.meduim OS Image: Bottlerocket OS 1.19.1 (aws-k8s-1.29) Kernel Version: 6.1.72 Container Runtime: containerd://1.6.28+bottlerocket

The problem workload is Prometheus - so this is not an issue with only JVM. The behavior is very similar to what @JacobHenner has described. In my case, if I give "enough" time(many hours) to the node, it comes back up without the need for a reboot.

Sep 23 '24 02:09 geddah

Hi there, for information and context have you tried the above suggestion of adjusting your resources specifications in your kubernetes pods, on t3.medium you will need to restrict memory ideally below the 4gb memory limit so something like:

resources:
   requests:
      memory: 3Gi
   limits:
      memory: 3.4Gi

You can play around with different values here, by doing this you can have kubernetes restrict usage of memory to help reduce the impact of this issue, we are still trying to look into the exact cause of this issue to see what we can do on our side to help address this.

Sep 27 '24 18:09 jmt-lab

Hi there, for information and context have you tried the above suggestion of adjusting your resources specifications in your kubernetes pods, on t3.medium you will need to restrict memory ideally below the 4gb memory limit so something like:
resources:
   requests:
      memory: 3Gi
   limits:
      memory: 3.4Gi
You can play around with different values here, by doing this you can have kubernetes restrict usage of memory to help reduce the impact of this issue, we are still trying to look into the exact cause of this issue to see what we can do on our side to help address this.

Unfortunately it's not this simple for us. Our clusters host a variety of workloads, and these workloads are managed by a variety of teams. Application owners are told to set their memory requests and limits appropriately, but for a range of reasons this doesn't always happen. In the case where an application is exceeding its memory request and the node encounters memory pressure, we'd expect that application to be OOMKilled or node-pressure evicted because it is misconfigured. We would not expect other properly-configured applications to suffer because memory pressure has caused the entire node to lock up.

So we wonder - why aren't pods being killed quickly enough under node memory pressure conditions to prevent this issue from occurring? How can we rely on the Kubernetes requests/limits model if the potential consequences of exceeding a request isn't limited to the offending container, but rather the entire node the container is running on? Team multitenancy seems untenable under these conditions.

Sep 29 '24 23:09 JacobHenner

Thanks @jmt-lab for the response. Even though configuring limits reduces the number of crashes, they still happen as our workloads grow in number and capacity. I will keep following this thread for the real fix. Let me know if you need any more context that might help your team with the fix. All the best!

Sep 30 '24 13:09 geddah

@JacobHenner @geddah @James-Quigley thank you all for the additional data as we work through a root cause for this issue. One commonality between your problematic variants (and those of previous commenters on this issue) is that the version of containerd in the variants reported here has been susceptible to https://github.com/containerd/containerd/issues/10589. The implication of this issue, as I understand it, is that OOM-killed containers may still appear to containerd as running; this seems to be the inverse of our problem since processes in this bug report are getting OOM-killed, but the fact that this is a bug with the OOM killer makes me suspicious nonetheless.

Bottlerocket v1.23.0 bumped containerd to v1.7.22 with a patch for this bug via bottlerocket-core-kit v2.5.0, which inclues https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/143. Would you be willing to upgrade your Bottlerocket nodes to v1.23.0+ and report back with any additional findings?

Oct 01 '24 16:10 ginglis13

I can confirm that this happened to me this morning on a node running Bottlerocket OS 1.23.0 (aws-k8s-1.28) Screenshot 2024-10-01 at 3 40 58 PM

Here's a screenshot of the memory metrics for the node/pods at that time.

Oct 01 '24 19:10 James-Quigley

Same experience of unresponsiveness of ec2.

config:

latest bottlerocket 1.24.1
flavor: ecs-2 x86_64
only using t*.medium burstable instance (dev environment)
memory soft limit on ecs task
ebs gp2 100 IOPS

I never got the issue on my production environment, same config except ec2 types are c/m * xlarge

For now my current test is to move my ebs config to gp3 for the 3000 IOPS. I will see if I continue to see the problem.

Oct 10 '24 15:10 remiflament

Repro

I was able to reproduce the behavior by running a custom program (eatmem) that aggressively consumes all the host memory gradually (400MB per second). The eatmem program - https://gist.github.com/ytsssun/b264c49cdfb8959b6d50b07c9cf4cb32.

I ran a process to monitor the memory usage:

while true; do free -m | grep Mem | while read -r _ total used _; do percent=$((used * 100 / total)); printf "Memory Usage: %s / %s (%d%%)\n" "${used}M" "${total}M" "$percent"; done; sleep 1; done

The node became unresponsive after the memory consumed hits 100% and it never came back. kubectl shows node status as NotReady and stuck in that state.

Attempt for mitigation

I tried deploying the userspace oom-killer as DaemonSet for my cluster. It simply manually trigger the oom-killer by echo f > /host_proc/sysrq-trigger when it detects the memory is below a certain threshold.

After that, I ran the eatmem program on a healthy node, it went frozen for about 8min and it came back. I had to adjust the MIN_MEMORY (to say 1000MB) and the interval for memory usage detection (to 1s) for my use case. You can see my adjusted oom-killer.

I recommend giving this lightweight oom-killer a try and see if that would help with the issue. The oom-killer would try to kill process based on the oom_score_adj computed for it. Process with highest oom_score_adj would get killed first.

`oom-killer` log analysis

Below are more details by analyzing the dmesg log.

Sysrq manual trigger invoked at timestamp 239.457726.

[  239.457726] sysrq: Manual OOM execution

Initially, oom-killer killed process which has oom_score_adj of 1000.

[  239.458525] Out of memory: Killed process 5773 (mpi-operator) total-vm:1871132kB, anon-rss:11272kB, file-rss:36820kB, shmem-rss:0kB, UID:0 pgtables:288kB oom_score_adj:1000

Subsequent oom-killing events, killed a few other processes that has higher oom_score_adj.

[  526.448344] Out of memory: Killed process 6475 (free) total-vm:23112kB, anon-rss:128kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:1000
[  542.664490] Out of memory: Killed process 6473 (sh) total-vm:4508kB, anon-rss:108kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:1000
[  542.850447] Out of memory: Killed process 2899 (sh) total-vm:4508kB, anon-rss:104kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:1000
[  542.970388] Out of memory: Killed process 5611 (coredns) total-vm:1283516kB, anon-rss:9900kB, file-rss:0kB, shmem-rss:0kB, UID:65532 pgtables:216kB oom_score_adj:996
[  542.971686] Out of memory: Killed process 5644 (coredns) total-vm:1283516kB, anon-rss:9900kB, file-rss:0kB, shmem-rss:0kB, UID:65532 pgtables:216kB oom_score_adj:996
[  545.663240] Out of memory: Killed process 2357 (amzn-guardduty-) total-vm:929236kB, anon-rss:35652kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:248kB oom_score_adj:984

The oom-killer found eatmem process and killed it.

[  719.286731] Out of memory: Killed process 5978 (eatmem) total-vm:17801320kB, anon-rss:15257072kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:29996kB oom_score_adj:1

The system shows recovery:

[  731.291144] IPv6: ADDRCONF(NETDEV_CHANGE): eni7e37f784474: link becomes ready
[  731.291178] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  734.740979] pci 0000:00:06.0: [1d0f:ec20] type 00 class 0x020000
[  734.741076] pci 0000:00:06.0: reg 0x10: [mem 0x00000000-0x00003fff]
[  734.741148] pci 0000:00:06.0: reg 0x18: [mem 0x00000000-0x000fffff pref]
[  734.741323] pci 0000:00:06.0: enabling Extended Tags
[  734.742062] pci 0000:00:06.0: BAR 2: assigned [mem 0xc1600000-0xc16fffff pref]
[  734.742085] pci 0000:00:06.0: BAR 0: assigned [mem 0xc1514000-0xc1517fff]
[  734.742146] ena 0000:00:06.0: enabling device (0000 -> 0002)
[  734.749973] ena 0000:00:06.0: ENA device version: 0.10
[  734.749977] ena 0000:00:06.0: ENA controller version: 0.0.1 implementation version 1
[  734.849623] ena 0000:00:06.0: ENA Large LLQ is disabled
[  734.861392] ena 0000:00:06.0: Elastic Network Adapter (ENA) found at mem c1514000, mac addr 02:82:87:02:9b:6d
[  734.926653] ena 0000:00:06.0 eth1: Local page cache is disabled for less than 16 channels

Time between SysRq trigger and initial recovery 731.291144 - 239.457726 = 491.833418 seconds ≈ 8 minutes 11 seconds

Oct 14 '24 18:10 ytsssun

@ytsssun for testing purposes the "lightweight oom-killer" might be OK, but it has some notable defects:

it will invoke the OOM killer every second if the system is below 1 GiB free memory; this could happen constantly on nodes that are otherwise not seeing a runaway process consuming memory
it seems like the script itself may be getting OOM-killed; the kill of free is suspicious

I don't think this makes sense as the basis for a production solution. I'd look at running something like nohang or earlyoom in a daemonset instead.

Oct 14 '24 19:10 bcressey

@bcressey is right, this simple oom-killer is for testing purpose. I am evaluating nohang, I will provide my updates once done.

Oct 14 '24 19:10 ytsssun

Unresponsive/unreachable Bottlerocket EKS nodes

Repro

Attempt for mitigation

oom-killer log analysis

`oom-killer` log analysis