Unresponsive/unreachable Bottlerocket EKS nodes
Hey folks,
Coming to you with an odd issue with Bottlerocket EKS nodes becoming unresponsive and unreachable, but still happily running as seen by EC2.
Image I'm using:
ami-089e696e7c541c61b (amazon/bottlerocket-aws-k8s-1.29-aarch64-v1.20.2-536d69d0)
System Info:
Kernel Version: 6.1.90
OS Image: Bottlerocket OS 1.20.2 (aws-k8s-1.29)
Operating System: linux
Architecture: arm64
Container Runtime Version: containerd://1.6.31+bottlerocket
Kubelet Version: v1.29.1-eks-61c0bbb
Kube-Proxy Version: v1.29.1-eks-61c0bbb
We use Karpenter as a node provisioner.
What I expected to happen:
Node runs smoothly.
What actually happened:
A large node (m7g.8xlarge) is running a memory-heavy JVM workload from a StatefulSet that takes up most of the machine (request=115Gi out of 128Gi).
Node runs happily for a while (~ 24h to several days), until it suddenly stops reporting status to Kubernetes:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Tue, 25 Jun 2024 15:18:44 +0200 Tue, 25 Jun 2024 15:20:38 +0200 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Tue, 25 Jun 2024 15:18:44 +0200 Tue, 25 Jun 2024 15:20:38 +0200 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Tue, 25 Jun 2024 15:18:44 +0200 Tue, 25 Jun 2024 15:20:38 +0200 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Tue, 25 Jun 2024 15:18:44 +0200 Tue, 25 Jun 2024 15:20:38 +0200 NodeStatusUnknown Kubelet stopped posting node status.
Kubernetes marks that node as NotReady and taints it with node.kubernetes.io/unreachable, then tries to delete the pod running our large workload. That doesn't work (the node is fully unresponsive), so that pod is stuck in status Terminating.
The node is unreachable via kubectl debug node ... (which times out), or via AWS SSM (which complains of "SSM Agent is not online", "Ping status: Connection lost"). The EC2 Serial Console is empty (all black). We do not have SSH enabled on these machines.
However, the node still appears as running from the EC2 console, reachability metrics are green, and we can see CPU / network metrics flowing in.
I've managed to get system logs by taking an EBS snapshot of the Bottlerocket data volume and restoring it to a separate volume for investigation. This was not helpful unfortunately: logs appear normal until the time the node dies, then suddenly stop. There is no indication that anything in particular (kubelet, containerd, etc) crashed and brought the instance down, but also, suspisciously, no logs at all from the moment the node went unresponsive.
How to reproduce the problem:
No clear way to reproduce unfortunately; we've seen this happen sporadically on maybe half a dozen instances over the past few weeks, out of several hundred
This issue is very annoying: I don't mind having pods crashing and/or getting evicted sometimes, or even kubelet/containerd crashing, but I'd expect it to self-heal eventually. This causes our workloads to get stuck, and we have to manually delete the pod and/or the node to get it back to normal. But even worse, I can't see a way to debug it properly or get to the bottom of it.
Would you have any idea of a better way to debug this?
Thank you!
(Note: I also opened an AWS support ticket to see if there's any AWS-level issue at play here, but this seems to happen only on this specific workload on Bottlerocket nodes, so I suspect something is off here)
Thanks for opening this issue @Pluies. This does look concerning and we would like to get to the bottom of this! I'll look internally for that support ticket to see if we can find out what is happening.
Okay, I've got good news - after 13 days of being in this state, one node came back to life! Well, kind of. We received a NodeClockNotSynchronising alert, and realised the prometheus node exporter started responding to scrapes again (???).
Apart from that, no change. The node is still showing as NotReady and unreachable, not reachable via kubectl debug or AWS SSM, etc.
I've taken a new snapshot of the bottlerocket data volume, and I'm seeing new logs in the journal though; attaching it here. journal-after-rebirth.log
I think this has the meat of it (kernel messages + OOM kills).
There are some more messages later showing kubelet errors; I assume the node is so far gone at this stage that it's unable to refresh its credentials and rejoin the cluster.
Those logs do help. It looks like this could be related to an EBS availability since it seems to be related to hanging tasks with the storage driver:
INFO: task jbd2/nvme2n1-8:6604 blocked for more than 122 seconds.
kernel: fbcon: Taking over console
kernel: Not tainted 6.1.90 #1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:jbd2/nvme2n1-8 state:D stack:0 pid:6604 ppid:2 flags:0x00000008
kernel: Call trace:
kernel: __switch_to+0xbc/0xf0
kernel: __schedule+0x258/0x6e0
kernel: schedule+0x58/0xf0
kernel: jbd2_journal_wait_updates+0x88/0x130
kernel: jbd2_journal_commit_transaction+0x264/0x1bc0
kernel: kjournald2+0x15c/0x470
kernel: kthread+0xd0/0xe0
Can you confirm how many instances this happened to and what storage type you are using for your volumes? I'll keep digging in as best I can as well!
Sure, storage-wise the machine has 4 EBS volumes attached:
- For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)
- For our workload: two large PVCs provisioned by the AWS EBS CSI driver. Both are
gp3, one is formatted as ext4 (2000GB), the other is xfs (500GB) - Additionally, the workload mounts an EFS volume and it's configured to write a java heap dump to this volume in case of Java OOM
To keep everyone in sync, AWS Support seems to think this is due to memory pressure. The large workload has no memory limit; so this is definitely a possibility. We've been working off the assumption that if they do use too much memory, we would still be protected by the OOMKiller and other limits such as kube-reserved and system-reserved memory; but I realise this assumption might be wrong.
The metrics also showed this re memory pressure:
# HELP node_pressure_cpu_waiting_seconds_total Total time in seconds that processes have waited for CPU time
# TYPE node_pressure_cpu_waiting_seconds_total counter
node_pressure_cpu_waiting_seconds_total 683567.871622
# HELP node_pressure_io_stalled_seconds_total Total time in seconds no process could make progress due to IO congestion
# TYPE node_pressure_io_stalled_seconds_total counter
node_pressure_io_stalled_seconds_total 3823.9046439999997
# HELP node_pressure_io_waiting_seconds_total Total time in seconds that processes have waited due to IO congestion
# TYPE node_pressure_io_waiting_seconds_total counter
node_pressure_io_waiting_seconds_total 18942.701305
# HELP node_pressure_memory_stalled_seconds_total Total time in seconds no process could make progress due to memory congestion
# TYPE node_pressure_memory_stalled_seconds_total counter
node_pressure_memory_stalled_seconds_total 1.141417592027e+06
# HELP node_pressure_memory_waiting_seconds_total Total time in seconds that processes have waited for memory
# TYPE node_pressure_memory_waiting_seconds_total counter
node_pressure_memory_waiting_seconds_total 1.141678870877e+06
The memory-related stall, 1.141678870877e+06 seconds, corresponds to just over 13 days, which corresponds pretty well to how long the node was dead for. This might just be a coincidence, if this is shared by all processes, but interesting nonetheless.
On another node (not dead π ) with the same workload, these metrics come up as:
# HELP node_pressure_cpu_waiting_seconds_total Total time in seconds that processes have waited for CPU time
# TYPE node_pressure_cpu_waiting_seconds_total counter
node_pressure_cpu_waiting_seconds_total 10647.531140000001
# HELP node_pressure_io_stalled_seconds_total Total time in seconds no process could make progress due to IO congestion
# TYPE node_pressure_io_stalled_seconds_total counter
node_pressure_io_stalled_seconds_total 5830.272114
# HELP node_pressure_io_waiting_seconds_total Total time in seconds that processes have waited due to IO congestion
# TYPE node_pressure_io_waiting_seconds_total counter
node_pressure_io_waiting_seconds_total 10842.729424000001
# HELP node_pressure_memory_stalled_seconds_total Total time in seconds no process could make progress due to memory congestion
# TYPE node_pressure_memory_stalled_seconds_total counter
node_pressure_memory_stalled_seconds_total 4560.770407999999
# HELP node_pressure_memory_waiting_seconds_total Total time in seconds that processes have waited for memory
# TYPE node_pressure_memory_waiting_seconds_total counter
node_pressure_memory_waiting_seconds_total 4941.830446999999
Anyway, I'm now adding a memory limit to the large workload, so if this is memory-related it should fix the issue for us. π
If this is the case though, is it still potentially a bottlerocket issue? I'd have expected either a full crash or eventual self-healing, but we ended up in this half-broken state instead. On the other hand, it would also be reasonable for Bottlerocket to not support unbounded workloads.
For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)
One thing I want to call out here just before we lose it in the rest. I don't think you are benefiting from making the root volume 50GB. By default, Bottlerocket won't expand to use anything more than the default size (typically 2GB) of the root volume. Everything that might need variable storage space should end up on the second volume. I'd like to double check if you could save on some EBS costs by avoiding the addition 48GB of unused space.
AWS Support seems to think this is due to memory pressure.
I do think the memory pressure is causing some of this problem, but I don't think it should result in the node ending up in this state. The fact that the rest of the host is starved, even the storage layer, feels off at first glance. I'd like to see if memory limit helps you avoid this but I still think there is something odd going on here. The OOM killer should be able to recover and protect the rest of the OS from situations like this. Do you have any special configuration for this workload like setting additional privileges or capabilities?
For Bottlerocket: two standard EBS volumes (50GB for the root volume, 100GB for the data volume)
One thing I want to call out here just before we lose it in the rest. I don't think you are benefiting from making the root volume 50GB. By default, Bottlerocket won't expand to use anything more than the default size (typically 2GB) of the root volume. Everything that might need variable storage space should end up on the second volume. I'd like to double check if you could save on some EBS costs by avoiding the addition 48GB of unused space.
Thank you for that! I'll bring it down π
AWS Support seems to think this is due to memory pressure.
I do think the memory pressure is causing some of this problem, but I don't think it should result in the node ending up in this state. The fact that the rest of the host is starved, even the storage layer, feels off at first glance. I'd like to see if memory limit helps you avoid this but I still think there is something odd going on here. The OOM killer should be able to recover and protect the rest of the OS from situations like this. Do you have any special configuration for this workload like setting additional privileges or capabilities?
None in particular, here's the podspec:
apiVersion: v1
kind: Pod
metadata:
annotations:
dune.com/tagger: dune:subproject=cluster-1
creationTimestamp: "2024-06-28T14:15:40Z"
generateName: cluster-1-q5mjg-1-workers-
labels:
app.kubernetes.io/created-by: trino-operator
app.kubernetes.io/instance: cluster-1-q5mjg-1
app.kubernetes.io/name: trinocluster
app.kubernetes.io/part-of: trino-operator
app.kubernetes.io/role: worker
apps.kubernetes.io/pod-index: "9"
controller-revision-hash: cluster-1-q5mjg-1-workers-5d877d47f9
dune.com/clusterset: cluster-1
statefulset.kubernetes.io/pod-name: cluster-1-q5mjg-1-workers-9
name: cluster-1-q5mjg-1-workers-9
namespace: query-engine
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: StatefulSet
name: cluster-1-q5mjg-1-workers
uid: b8b5ba7c-4457-4216-9aa7-31613391ab38
resourceVersion: "2297524264"
uid: f5f3ec89-5cdf-4eca-9bd1-d130bd36b1a2
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: dunetech.io/nodegroup
operator: In
values:
- trino
- key: node.kubernetes.io/instance-type
operator: In
values:
- m7g.8xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-1b
- matchExpressions:
- key: dunetech.io/nodegroup
operator: In
values:
- trino
- key: node.kubernetes.io/instance-type
operator: In
values:
- m6g.8xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-1b
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/created-by: trino-operator
app.kubernetes.io/instance: cluster-1-q5mjg-1
app.kubernetes.io/name: trinocluster
app.kubernetes.io/part-of: trino-operator
app.kubernetes.io/role: worker
topologyKey: kubernetes.io/hostname
containers:
- env:
- name: AWS_STS_REGIONAL_ENDPOINTS
value: regional
- name: AWS_DEFAULT_REGION
value: eu-west-1
- name: AWS_REGION
value: eu-west-1
- name: AWS_ROLE_ARN
value: arn:aws:iam::aws-account-id-redacted:role/prod_eks_trino
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token
image: aws-account-id-redacted.dkr.ecr.eu-west-1.amazonaws.com/trino:2024-06-28T08-59-33-main-57ed4c8
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 20
httpGet:
path: /v1/info
port: http
scheme: HTTP
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 60
name: trino
ports:
- containerPort: 8080
name: http
protocol: TCP
- containerPort: 9000
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /v1/info
port: http
scheme: HTTP
initialDelaySeconds: 5
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
requests:
cpu: "30"
memory: 115Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/trino
name: config-volume
- mountPath: /etc/trino/catalog
name: catalog-volume
- mountPath: /data/view-overrides
name: view-overrides
readOnly: true
- mountPath: /etc/trino-event-listeners/event-listener.properties
name: dune-event-listener
readOnly: true
subPath: event-listener.properties
- mountPath: /cache
name: cache
- mountPath: /etc/trino-resource-groups/resource-groups.json
name: resource-groups
readOnly: true
subPath: resource-groups.json
- mountPath: /heapdumps/
name: heapdumps
- mountPath: /var/lib/trino/prometheus-exporter
name: prometheus-exporter-config-volume
- mountPath: /var/lib/trino/spill
name: spilltodisk
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-4mxl5
readOnly: true
- mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
name: aws-iam-token
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: cluster-1-q5mjg-1-workers-9
nodeName: ip-10-0-56-162.eu-west-1.compute.internal
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1000
runAsGroup: 1000
runAsUser: 1000
serviceAccount: trino
serviceAccountName: trino
subdomain: cluster-1-q5mjg-1-workers
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: dunetech.io/nodegroup
operator: Equal
value: trino
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: aws-iam-token
projected:
defaultMode: 420
sources:
- serviceAccountToken:
audience: sts.amazonaws.com
expirationSeconds: 86400
path: token
- name: cache
persistentVolumeClaim:
claimName: cache-cluster-1-q5mjg-1-workers-9
- name: spilltodisk
persistentVolumeClaim:
claimName: spilltodisk-cluster-1-q5mjg-1-workers-9
- configMap:
defaultMode: 420
name: cluster-1-q5mjg-1-workers
name: config-volume
- name: catalog-volume
secret:
defaultMode: 420
secretName: cluster-1-catalogs-ee5f0
- configMap:
defaultMode: 420
name: trino-view-overrides
name: view-overrides
- name: dune-event-listener
secret:
defaultMode: 420
secretName: cluster-1-event-listener-7476a
- configMap:
defaultMode: 420
name: cluster-1-resource-group-a800e
name: resource-groups
- name: heapdumps
persistentVolumeClaim:
claimName: trino-heapdumps
- configMap:
defaultMode: 420
name: cluster-1-prometheus-exporter
name: prometheus-exporter-config-volume
- name: kube-api-access-4mxl5
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
Since we added limits to the podspec above (3 days ago) nodes have been solid. I'll report back here later to confirm whether this fixed everything π
I was wondering whether it could be replicated, so I ran a stress-ng pod with request=115Gi and no limit and made it eat up memory... But no, everything behaves well in this case: the pod is OOMKilled once it reaches the threshold as expected, and the node stays healthy.
Thanks for the detailed response @Pluies . It is curious that stress-ng didn't reproduce this but your workload does. Can you clarify how you set limits so that if others run across this issue, they have a possible work around to try? I'd like to document this since it seems like the memory limit might be a good way to deal with this.
I'm unsure what could cause this issue but there are a few things we could try to help hone in what might be the problem. It might be curious to see if changing the priority: 0 or preemptionPolicy: PreemptLowerPriority could change this behavior. I'm also curious to see if cgroupsv2 could have some nuance, using an earlier k8s variant (1.25 or earlier) or changing back to cgroups v1 might a useful experiment. One other experiment we might try is using the AL EKS Optimized AMI to see if it reproduces there as well (since they can share the same kernel but differ in a few minor ways). You have already provided a ton of info so thank you and no pressure to try these out, I'm mostly documenting these for next steps on what to try next since we still haven't found out why the nodes hang like they do under this memory pressure.
Thanks for the detailed response @Pluies . It is curious that
stress-ngdidn't reproduce this but your workload does. Can you clarify how you set limits so that if others run across this issue, they have a possible work around to try? I'd like to document this since it seems like the memory limit might be a good way to deal with this.
Absolutely, our fix was to add the limit block to our resource declaration in the podspec, i.e.:
resources:
requests:
cpu: "30"
memory: 115Gi
limits:
memory: 118Gi
The value of 118Gi was picked so as to be very close to the full amount of allocatable memory on the node, as retrieved with kubectl describe node. It can also be computed manually by taking the node memory and subtracting the kube-reserved memory and system-reserved memory and the memory eviction threshold.
I'm unsure what could cause this issue but there are a few things we could try to help hone in what might be the problem. It might be curious to see if changing the
priority: 0orpreemptionPolicy: PreemptLowerPrioritycould change this behavior. I'm also curious to see if cgroupsv2 could have some nuance, using an earlier k8s variant (1.25 or earlier) or changing back to cgroups v1 might a useful experiment. One other experiment we might try is using the AL EKS Optimized AMI to see if it reproduces there as well (since they can share the same kernel but differ in a few minor ways). You have already provided a ton of info so thank you and no pressure to try these out, I'm mostly documenting these for next steps on what to try next since we still haven't found out why the nodes hang like they do under this memory pressure.
I'm afraid I'm not going to be able to test these here as the issues happen on our prod environments and would directly impact paying customers, but these are good avenues of investigation for whoever wants to pick this up in the future π
Thank you for your help @yeazelm !
We have noticed this happening in one of our clusters in the last week as well. A few observations:
- We use Karpenter.
- The problem nodes have all been running Bottlerocket OS 1.20.3.
- ~We do not suspect any high cpu/memory pressure on these nodes. Workloads are variable but it's in a Dev cluster where most things have very little load.~ Now that I'm looking deeper into this, I have noticed that many of the pods where this happens have memory pressure and/or
SystemOOMevents immediately before they crash. So it does appear to be memory related. - ~All~ Most of the nodes that I've checked so far have been Graviton, typically latest gen (e.g. c7g). We did see this happen on a c5.xlarge too.
- The node will stay running in a
NotReadystate indefinitely until someone cleans it up manually. It doesn't seem like kubelet is able to recover (i.e. by restarting).
Example node in bad state:
Name: ip-10-2-101-118.us-east-2.compute.internal
Roles: <none>
Labels: beta.kubernetes.io/arch=arm64
beta.kubernetes.io/instance-type=c7g.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2c
k8s.io/cloud-provider-aws=e1616aad6d45db84259aa7890511291f
karpenter.k8s.aws/instance-category=c
karpenter.k8s.aws/instance-cpu=4
karpenter.k8s.aws/instance-cpu-manufacturer=aws
karpenter.k8s.aws/instance-encryption-in-transit-supported=true
karpenter.k8s.aws/instance-family=c7g
karpenter.k8s.aws/instance-generation=7
karpenter.k8s.aws/instance-hypervisor=nitro
karpenter.k8s.aws/instance-memory=8192
karpenter.k8s.aws/instance-network-bandwidth=1876
karpenter.k8s.aws/instance-size=xlarge
karpenter.sh/capacity-type=spot
karpenter.sh/initialized=true
karpenter.sh/nodepool=default
karpenter.sh/registered=true
kubernetes.io/arch=arm64
kubernetes.io/hostname=ip-10-2-x-x.us-east-2.compute.internal
kubernetes.io/os=linux
node.kubernetes.io/instance-type=c7g.xlarge
role=application
topology.ebs.csi.aws.com/zone=us-east-2c
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2c
Annotations: alpha.kubernetes.io/provided-node-ip: 10.2.x.x
csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-redacted","efs.csi.aws.com":"i-redacted"}
karpenter.k8s.aws/ec2nodeclass-hash: 13086204471037975526
karpenter.k8s.aws/ec2nodeclass-hash-version: v2
karpenter.sh/nodepool-hash: 647773607843705194
karpenter.sh/nodepool-hash-version: v2
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 02 Jul 2024 17:46:19 -0700
Taints: node.kubernetes.io/unreachable:NoExecute
node.kubernetes.io/unreachable:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: ip-10-2-x-x.us-east-2.compute.internal
AcquireTime: <unset>
RenewTime: Wed, 03 Jul 2024 11:28:47 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure Unknown Wed, 03 Jul 2024 11:24:16 -0700 Wed, 03 Jul 2024 11:25:45 -0700 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Wed, 03 Jul 2024 11:24:16 -0700 Wed, 03 Jul 2024 11:25:45 -0700 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Wed, 03 Jul 2024 11:24:16 -0700 Wed, 03 Jul 2024 11:25:45 -0700 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Wed, 03 Jul 2024 11:24:16 -0700 Wed, 03 Jul 2024 11:25:45 -0700 NodeStatusUnknown Kubelet stopped posting node status.
Addresses:
InternalIP: 10.2.x.x
InternalDNS: ip-10-2-x-x.us-east-2.compute.internal
Hostname: ip-10-2-x-x.us-east-2.compute.internal
Capacity:
cpu: 4
ephemeral-storage: 81854Mi
hugepages-1Gi: 0
hugepages-2Mi: 0
hugepages-32Mi: 0
hugepages-64Ki: 0
memory: 7947912Ki
pods: 58
Allocatable:
cpu: 3920m
ephemeral-storage: 76173383962
hugepages-1Gi: 0
hugepages-2Mi: 0
hugepages-32Mi: 0
hugepages-64Ki: 0
memory: 6931080Ki
pods: 58
System Info:
Machine ID: ec2xxxx8bb
System UUID: ecxxxxbb
Boot ID: 92xxxx16
Kernel Version: 6.1.92
OS Image: Bottlerocket OS 1.20.3 (aws-k8s-1.29)
Operating System: linux
Architecture: arm64
Container Runtime Version: containerd://1.6.31+bottlerocket
Kubelet Version: v1.29.1-eks-61c0bbb
Kube-Proxy Version: v1.29.1-eks-61c0bbb
ProviderID: aws:///us-east-2c/i-06xxxx
Non-terminated Pods: (24 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
aws-controllers s3-tc-controller-manager-64bddc98cc-fknbs 15m (0%) 1 (25%) 128Mi (1%) 256Mi (3%) 12h
brave-quilt api-redis-master-0 200m (5%) 500m (12%) 192Mi (2%) 640Mi (9%) 3h59m
brave-quilt authentication-service-redis-master-0 200m (5%) 500m (12%) 192Mi (2%) 512Mi (7%) 112m
brave-quilt securities-service-redis-master-0 200m (5%) 500m (12%) 192Mi (2%) 512Mi (7%) 3h59m
bright-wrench dataplatform-airflow-worker-0 200m (5%) 1 (25%) 1088Mi (16%) 2432Mi (35%) 4h39m
calm-gender securities-service-5cdc94bf8c-2hj2g 700m (17%) 750m (19%) 1216Mi (17%) 1792Mi (26%) 17h
exit-karate integrations-service-mysql-0 100m (2%) 500m (12%) 64Mi (0%) 384Mi (5%) 12h
exit-karate trading-api-redis-master-0 200m (5%) 500m (12%) 128Mi (1%) 448Mi (6%) 12h
infra-test securities-service-6c44fc7fc-qlgmg 700m (17%) 750m (19%) 1216Mi (17%) 1792Mi (26%) 12h
jonesgang integrations-service-mysql-0 100m (2%) 500m (12%) 64Mi (0%) 384Mi (5%) 12h
kube-system aws-node-jhbcv 50m (1%) 0 (0%) 0 (0%) 0 (0%) 17h
kube-system ebs-csi-node-w5cn2 30m (0%) 0 (0%) 120Mi (1%) 768Mi (11%) 17h
kube-system efs-csi-node-thtt9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
kube-system eks-pod-identity-agent-bz5xp 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
kube-system kube-proxy-wm5rg 100m (2%) 0 (0%) 0 (0%) 0 (0%) 17h
kube-system node-local-dns-z5dnh 25m (0%) 0 (0%) 5Mi (0%) 0 (0%) 17h
legal-horsea integrations-service-mysql-0 100m (2%) 500m (12%) 64Mi (0%) 384Mi (5%) 28m
mlaskowski-cc2 shared-zookeeper-2 0 (0%) 0 (0%) 0 (0%) 0 (0%) 12h
monitoring aws-for-fluent-bit-52jf5 50m (1%) 0 (0%) 50Mi (0%) 250Mi (3%) 17h
monitoring kube-prometheus-stack-prometheus-node-exporter-876f5 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h
quirky-laundry shared-zookeeper-0 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5h41m
sstrozier securities-service-57f9db5bfb-wwgr2 700m (17%) 750m (19%) 1216Mi (17%) 1792Mi (26%) 17h
worst-druid authentication-service-redis-master-0 200m (5%) 500m (12%) 192Mi (2%) 512Mi (7%) 4h39m
worst-druid shared-zookeeper-1 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 3870m (98%) 8250m (210%)
memory 6127Mi (90%) 12858Mi (189%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
hugepages-32Mi 0 (0%) 0 (0%)
hugepages-64Ki 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal DisruptionBlocked 31m (x7 over 17h) karpenter Cannot disrupt Node: Nominated for a pending pod
Normal Unconsolidatable 25m karpenter Can't remove without creating 2 candidates
Normal DisruptionBlocked 19m karpenter Cannot disrupt Node: PDB "infra-test/securities-service" prevents pod evictions
Normal DisruptionBlocked 15m (x2 over 12h) karpenter Cannot disrupt Node: PDB "mlaskowski-cc2/shared-zookeeper" prevents pod evictions
Normal DisruptionBlocked 13m (x3 over 28m) karpenter Cannot disrupt Node: PDB "worst-druid/shared-zookeeper" prevents pod evictions
Normal NodeNotReady 10m kubelet Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeNotReady
Normal NodeHasNoDiskPressure 10m (x3 over 17h) kubelet Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientMemory 10m (x3 over 17h) kubelet Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasSufficientPID 10m (x3 over 17h) kubelet Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeHasSufficientPID
Normal NodeReady 10m (x2 over 17h) kubelet Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeReady
Normal DisruptionBlocked 9m28s (x3 over 17m) karpenter Cannot disrupt Node: PDB "sstrozier/securities-service" prevents pod evictions
Normal NodeNotReady 8m55s (x2 over 14m) node-controller Node ip-10-2-x-x.us-east-2.compute.internal status is now: NodeNotReady
Normal DisruptionBlocked 5m18s (x2 over 7m23s) karpenter Cannot disrupt Node: PDB "calm-gender/securities-service" prevents pod evictions
We've also seen these symptoms happening today on our cluster for the first time. Also running Bottlerocket OS 1.20.3. Based on the container_memory_working_set_bytes analysis for the node before it went out, it seems that there were major container memory usage step ups for the kubernetes-nodes-cadvisor job before the node finally runs out of memory.
@jessebye , and @igorbdl ,
Thank you for bringing this issue to our attention. To ensure weβre not dealing with multiple issues stemming from the same underlying problem, could you please provide additional details about your test environment? Specifically, it would be helpful to know the versions of EKS and Bottlerocket where your setup was functioning correctly, as well as the versions where youβre encountering issues. Additionally, have you attempted the solution suggested in this discussion thread involving setting a resource limit? If so, did it resolve your issue? Also, does your workload involve a memory-intensive JVM process?
@monirul We saw the issue with EKS 1.29 and Bottlerocket 1.20.3. We've only recently switched to using Bottlerocket, so I don't have a lot of data about other versions.
Regarding the resource limit, yes we found that setting a memory limit for the pod resolved the issue. In our case, we determined it was an Argo CD pod that was causing the issue, so not JVM (Argo is Golang).
We are seeing this as well. We opened up an AWS support case (@bcressey you can ping me privately for the ticket number) and uploaded node logs to that case. ~Rolling back to Bottlerocket 1.19.5 / 1.29 solved the issue. I want to add that we not only saw the containerd_shim deadlocking, but we had significant performance degradation of our workloads on the nodes.~
(spoke too soon.. rolling back to 1.19.5/1.30 did not fix the issue.. still troubleshooting)
Update: Our issue stemmed not from the Bottlerocket version, but actually from hammering the local root volume on very old EC2 instances. Our compute management partner launched r5b.24xl instances to satisfy a workload mistakenly, and that workload was writing extremely high amounts of I/O to the local ephemeral storage in our pods.
The interesting thing here is that only these r5b instances exhibited the problems with kubelet becoming non responsive and containderd_shim reporting a deadlock. Other instance were ... slow... but they worked and didn't fundamentally crash.
I've been able to reproduce the failure mode shared by both @diranged and @Pluies, and I've written up those reproduction steps in this gist: https://gist.github.com/ginglis13/bf37a32d0f2a70b4ac5d9b9e5a960278
tl;dr: overwhelm a large node (like m7g.8xlarge or r5b.24xlarge) with memory pressure and high disk I/O, and you'll start seeing dmesg logs like:
INFO: task FOO blocked for more than X seconds.
Important to note, I've run the same reproduction on the latest Amazon Linux 2 EKS 1.30 Optimized AMI (ami-0cfd96d646e5535a8) and observed the same failure mode and node state.
I'm going to start bisecting the problem by running this repro across a few Bottlerocket variants to see if there's changes that landed in a specific Bottlerocket version that are causing this, especially something in the kernels / container runtimes.
I've found a few blog posts on this specific error message from the kernel:
- https://vivani.net/2020/02/08/task-blocked-for-more-than-120-seconds/
- https://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/
Which suggest the settings vm.dirty_ratio and vm.dirty_backgroud_ratio could be tuned to be more aggressive in flushing data to disk in order to avoid this problem. Bottlerocket already sets more aggressive values for these settings than their defaults:
bash-5.1# sysctl -n vm.dirty_background_ratio
10
bash-5.1# sysctl -n vm.dirty_ratio
20
I'm working to run the repro with these settings set to even more aggressive values to see if that could be another workaround, other than the pod resource request limits suggested by @Pluies .
@jessebye @igorbdl Any more specific logs or details you could share about your respective encounters with this issue would be much appreciated in tracking this down. I'm particularly interested in:
- Do you see "INFO: task FOO blocked for more than X seconds." in
dmesgon your nodes too? I want to make sure you're running into the same class of problem (which it sounds like is the case from what you've shared so far, but this log message is the current most common denominator) - Are your workloads I/O intensive?
- Specifically for @igorbdl , what was the last known version of Bottlerocket that was working for you?
Hey @ginglis13 :wave: Great news that you were able to replicate
I've reviewed the job scheduling and memory limits to avoid this problem so I haven't seen this happening anymore and I haven't collected node logs at the time, so unfortunately I won't be able to help with more details on our occurrences. I'm sorry about that.
It seems we started having this issue since we upgraded from 1.20.2 to 1.20.3, but this sounds odd as other users have reported this issue with 1.20.2, so I could be wrong on this one.
We probably had this issue on our end of node system OOM for quite a while, but the nodes would be replaced as this would happen. The symptoms that lead me to reach out in this issue was that the node would not be flushed by EKS and the ones affected would remain in the cluster in NotReady status indefinitely and I'd have to manually delete the respective EC2 instance and delete the node on the cluster to clear things up. This issue also only seemed to happen with nodes that had pods with PVC's bounded.
The kernel message INFO: task FOO blocked for more than XXX seconds means:
- The task (process) is in state
Dalso known as uninterruptible. Some types of iowait will put a process into this state.- Some nominally uninterruptible processes can be killed, as long as the signal is fatal and not intercepted by a user-space signal handler. The hung task detector ignores these processes.
- The task has not had a context switch in the last XXX seconds.
Since something that hangs for two minutes will quite likely hang forever, the kernel can optionally panic and reboot when the hung task detector fires. If this is a desirable outcome, see these sysctl options:
kernel.hung_task_panic=1: the hung task detector will trigger a panic.kernel.panic=1: a panic will trigger reboot after one second.
If your applications are more robust when a node reboots than when a node hangs with unresponsive storage, this may help.
If one wants additional logging:
- Setting the kernel parameter
kernel.hung_task_all_cpu_backtrace=1will add a backtrace-per-CPU to the logging from the hung task detector. - Setting
kernel.hung_task_panic=1callsdebug_show_all_locks()before the call topanic().
Hello all:
We're also affected by this issue. We've recently adopted Bottlerocket in a subset of clusters, and have seen at least 5 instances of this failure.
The nodes it has affected have been
v1.28.10-eks-890c2acBottlerocket OS 1.21.1 (aws-k8s-1.28)6.1.102containerd://1.7.20+bottlerocket
Since we've only recently adopted Bottlerocket, we don't have data to compare the frequency of this issue between different versions of Bottlerocket. However, I don't yet have reason to believe this is a Bottlerocket-specific problem. We've seen node lockups with similar symptoms for a few years now (at least since 2021), through many versions of k8s and the Kernel, across instance types, across architectures, on AL 2 nodes. I recall that we've seen this affect AL2023 too, but I am less certain of the details for AL2023. For AL2, we previously opened support case 171149566301465, but it did not progress.
For both AL 2 and Bottlerocket, the manifestation has been similar:
- There is memory pressure on the node
- Cache memory is reclaimed
- Cache memory hits some critically low value (which varies)
- EBS disk IO skyrockets, the node locks up, and the control plane begins reporting NotReady
In most cases the node does not recover and must be terminated. In a few cases the node does recover (in the sense that it is marked Ready and becomes responsive again), but when this happens the node is often broken in some variable and hard-to-discover/hard-to-diagnose way - for example, kubelet and the container runtime will lose sync, a subset of network connections will fail, or containerized application processes will be in some unexpected state.
My theory is that the OOMKiller is essentially not kicking in fast enough when there's memory pressure, and that the extreme reclaim of cache memory leads to cache misses that force lots of disk IO to hit EBS. The EBS iops limit is hit, and then the system stalls on iowait. Eventually the condition improves (if/when the OOMKiller kicks in), but the extended stall and potentially subsequent system OOMKills break kubelet and other services, leading node failure.
Here are some metric visualizations of the problem, this one for a Bottlerocket instance. The AL 2 instance visualizations look similar, although they only have a single EBS volume.
Let me know if there are other metrics I can provide or experiments I can run to help get to the bottom of this issue. It'd be such a relief to not have to deal with the consequences of this recurring issue.
Chiming in to say I've also experienced this same thing. Bottlerocket with EKS (1.28) and Karpenter (latest). Node starts to experience memory pressure and then enters a NotReady state and stays there indefinitely until the underlying EC2 is terminated.
I've been able to repro with test pods running stress, but not consistently. Trying to figure out a way to consistently reproduce the issue
I am facing the same issue here. The setup I have is:
EC2 instance type: t3.meduim OS Image: Bottlerocket OS 1.19.1 (aws-k8s-1.29) Kernel Version: 6.1.72 Container Runtime: containerd://1.6.28+bottlerocket
The problem workload is Prometheus - so this is not an issue with only JVM. The behavior is very similar to what @JacobHenner has described. In my case, if I give "enough" time(many hours) to the node, it comes back up without the need for a reboot.
Hi there, for information and context have you tried the above suggestion of adjusting your resources specifications in your kubernetes pods, on t3.medium you will need to restrict memory ideally below the 4gb memory limit so something like:
resources:
requests:
memory: 3Gi
limits:
memory: 3.4Gi
You can play around with different values here, by doing this you can have kubernetes restrict usage of memory to help reduce the impact of this issue, we are still trying to look into the exact cause of this issue to see what we can do on our side to help address this.
Hi there, for information and context have you tried the above suggestion of adjusting your resources specifications in your kubernetes pods, on t3.medium you will need to restrict memory ideally below the 4gb memory limit so something like:
resources: requests: memory: 3Gi limits: memory: 3.4GiYou can play around with different values here, by doing this you can have kubernetes restrict usage of memory to help reduce the impact of this issue, we are still trying to look into the exact cause of this issue to see what we can do on our side to help address this.
Unfortunately it's not this simple for us. Our clusters host a variety of workloads, and these workloads are managed by a variety of teams. Application owners are told to set their memory requests and limits appropriately, but for a range of reasons this doesn't always happen. In the case where an application is exceeding its memory request and the node encounters memory pressure, we'd expect that application to be OOMKilled or node-pressure evicted because it is misconfigured. We would not expect other properly-configured applications to suffer because memory pressure has caused the entire node to lock up.
So we wonder - why aren't pods being killed quickly enough under node memory pressure conditions to prevent this issue from occurring? How can we rely on the Kubernetes requests/limits model if the potential consequences of exceeding a request isn't limited to the offending container, but rather the entire node the container is running on? Team multitenancy seems untenable under these conditions.
Thanks @jmt-lab for the response. Even though configuring limits reduces the number of crashes, they still happen as our workloads grow in number and capacity. I will keep following this thread for the real fix. Let me know if you need any more context that might help your team with the fix. All the best!
@JacobHenner @geddah @James-Quigley thank you all for the additional data as we work through a root cause for this issue. One commonality between your problematic variants (and those of previous commenters on this issue) is that the version of containerd in the variants reported here has been susceptible to https://github.com/containerd/containerd/issues/10589. The implication of this issue, as I understand it, is that OOM-killed containers may still appear to containerd as running; this seems to be the inverse of our problem since processes in this bug report are getting OOM-killed, but the fact that this is a bug with the OOM killer makes me suspicious nonetheless.
Bottlerocket v1.23.0 bumped containerd to v1.7.22 with a patch for this bug via bottlerocket-core-kit v2.5.0, which inclues https://github.com/bottlerocket-os/bottlerocket-core-kit/pull/143. Would you be willing to upgrade your Bottlerocket nodes to v1.23.0+ and report back with any additional findings?
I can confirm that this happened to me this morning on a node running Bottlerocket OS 1.23.0 (aws-k8s-1.28)
Here's a screenshot of the memory metrics for the node/pods at that time.
Same experience of unresponsiveness of ec2.
config:
- latest bottlerocket 1.24.1
- flavor: ecs-2 x86_64
- only using
t*.mediumburstable instance (dev environment) - memory soft limit on ecs task
- ebs gp2 100 IOPS
I never got the issue on my production environment, same config except ec2 types are c/m * xlarge
For now my current test is to move my ebs config to gp3 for the 3000 IOPS. I will see if I continue to see the problem.
Repro
I was able to reproduce the behavior by running a custom program (eatmem) that aggressively consumes all the host memory gradually (400MB per second). The eatmem program - https://gist.github.com/ytsssun/b264c49cdfb8959b6d50b07c9cf4cb32.
I ran a process to monitor the memory usage:
while true; do free -m | grep Mem | while read -r _ total used _; do percent=$((used * 100 / total)); printf "Memory Usage: %s / %s (%d%%)\n" "${used}M" "${total}M" "$percent"; done; sleep 1; done
The node became unresponsive after the memory consumed hits 100% and it never came back. kubectl shows node status as NotReady and stuck in that state.
Attempt for mitigation
I tried deploying the userspace oom-killer as DaemonSet for my cluster. It simply manually trigger the oom-killer by echo f > /host_proc/sysrq-trigger when it detects the memory is below a certain threshold.
After that, I ran the eatmem program on a healthy node, it went frozen for about 8min and it came back. I had to adjust the MIN_MEMORY (to say 1000MB) and the interval for memory usage detection (to 1s) for my use case. You can see my adjusted oom-killer.
I recommend giving this lightweight oom-killer a try and see if that would help with the issue. The oom-killer would try to kill process based on the oom_score_adj computed for it. Process with highest oom_score_adj would get killed first.
oom-killer log analysis
Below are more details by analyzing the dmesg log.
- Sysrq manual trigger invoked at timestamp
239.457726.
[ 239.457726] sysrq: Manual OOM execution
- Initially, oom-killer killed process which has
oom_score_adjof 1000.
[ 239.458525] Out of memory: Killed process 5773 (mpi-operator) total-vm:1871132kB, anon-rss:11272kB, file-rss:36820kB, shmem-rss:0kB, UID:0 pgtables:288kB oom_score_adj:1000
- Subsequent oom-killing events, killed a few other processes that has higher oom_score_adj.
[ 526.448344] Out of memory: Killed process 6475 (free) total-vm:23112kB, anon-rss:128kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:76kB oom_score_adj:1000
[ 542.664490] Out of memory: Killed process 6473 (sh) total-vm:4508kB, anon-rss:108kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:1000
[ 542.850447] Out of memory: Killed process 2899 (sh) total-vm:4508kB, anon-rss:104kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:60kB oom_score_adj:1000
[ 542.970388] Out of memory: Killed process 5611 (coredns) total-vm:1283516kB, anon-rss:9900kB, file-rss:0kB, shmem-rss:0kB, UID:65532 pgtables:216kB oom_score_adj:996
[ 542.971686] Out of memory: Killed process 5644 (coredns) total-vm:1283516kB, anon-rss:9900kB, file-rss:0kB, shmem-rss:0kB, UID:65532 pgtables:216kB oom_score_adj:996
[ 545.663240] Out of memory: Killed process 2357 (amzn-guardduty-) total-vm:929236kB, anon-rss:35652kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:248kB oom_score_adj:984
- The oom-killer found
eatmemprocess and killed it.
[ 719.286731] Out of memory: Killed process 5978 (eatmem) total-vm:17801320kB, anon-rss:15257072kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:29996kB oom_score_adj:1
- The system shows recovery:
[ 731.291144] IPv6: ADDRCONF(NETDEV_CHANGE): eni7e37f784474: link becomes ready
[ 731.291178] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 734.740979] pci 0000:00:06.0: [1d0f:ec20] type 00 class 0x020000
[ 734.741076] pci 0000:00:06.0: reg 0x10: [mem 0x00000000-0x00003fff]
[ 734.741148] pci 0000:00:06.0: reg 0x18: [mem 0x00000000-0x000fffff pref]
[ 734.741323] pci 0000:00:06.0: enabling Extended Tags
[ 734.742062] pci 0000:00:06.0: BAR 2: assigned [mem 0xc1600000-0xc16fffff pref]
[ 734.742085] pci 0000:00:06.0: BAR 0: assigned [mem 0xc1514000-0xc1517fff]
[ 734.742146] ena 0000:00:06.0: enabling device (0000 -> 0002)
[ 734.749973] ena 0000:00:06.0: ENA device version: 0.10
[ 734.749977] ena 0000:00:06.0: ENA controller version: 0.0.1 implementation version 1
[ 734.849623] ena 0000:00:06.0: ENA Large LLQ is disabled
[ 734.861392] ena 0000:00:06.0: Elastic Network Adapter (ENA) found at mem c1514000, mac addr 02:82:87:02:9b:6d
[ 734.926653] ena 0000:00:06.0 eth1: Local page cache is disabled for less than 16 channels
Time between SysRq trigger and initial recovery 731.291144 - 239.457726 = 491.833418 seconds β 8 minutes 11 seconds
@ytsssun for testing purposes the "lightweight oom-killer" might be OK, but it has some notable defects:
- it will invoke the OOM killer every second if the system is below 1 GiB free memory; this could happen constantly on nodes that are otherwise not seeing a runaway process consuming memory
- it seems like the script itself may be getting OOM-killed; the kill of
freeis suspicious
I don't think this makes sense as the basis for a production solution. I'd look at running something like nohang or earlyoom in a daemonset instead.
@bcressey is right, this simple oom-killer is for testing purpose. I am evaluating nohang, I will provide my updates once done.