amazon-eks-ami
amazon-eks-ami copied to clipboard
EKS k8s 1.19 - AMI 1.19-v20210414 - (combined from similar events): System OOM encountered, victim process:
What happened: System OOM happens in some of nodes after upgrade to ks8 1.19 (using AMI v20210414). It seems this happens more for bigger nodes, like r5.4x and r5.8x.
Warning SystemOOM 18m kubelet, ip-10-10-10-10.eu-central-1.compute.internal System OOM encountered, victim process: iptables, pid: 13778
Warning SystemOOM 18m kubelet, ip-10-10-10-10.eu-central-1.compute.internal System OOM encountered, victim process: iptables, pid: 13782
Warning SystemOOM 18m kubelet, ip-10-10-10-10.eu-central-1.compute.internal System OOM encountered, victim process: iptables, pid: 13836
Warning SystemOOM 18m kubelet, ip-10-10-10-10.eu-central-1.compute.internal System OOM encountered, victim process: iptables, pid: 13853
Warning SystemOOM 17m kubelet, ip-10-10-10-10.eu-central-1.compute.internal System OOM encountered, victim process: iptables, pid: 18796
Warning SystemOOM 17m kubelet, ip-10-10-10-10.eu-central-1.compute.internal System OOM encountered, victim process: iptables, pid: 18808
Warning SystemOOM 17m kubelet, ip-10-10-10-10.eu-central-1.compute.internal System OOM encountered, victim process: iptables, pid: 18819
Warning SystemOOM 17m (x538 over 2d6h) kubelet, ip-10-10-10-10.eu-central-1.compute.internal (combined from similar events): System OOM encountered, victim process: iptables, pid: 18883
Warning SystemOOM 17m kubelet, ip-10-10-10-10.eu-central-1.compute.internal System OOM encountered, victim process: iptables, pid: 18854
Warning SystemOOM 17m kubelet, ip-10-10-10-10.eu-central-1.compute.internal System OOM encountered, victim process: iptables, pid: 18880
What you expected to happen: We did not have System OOM issue in EKS k8s 1.18
How to reproduce it (as minimally and precisely as possible): EKS k8s 1.19 Nodes AMI: amazon-eks-node-1.19-v20210414 Region: Frankfurt (eu-central-1)
Anything else we need to know?:
Environment:
- AWS Region: Frankfurt
- Instance Type(s): r5.4xlarge, r5.8xlarge
- EKS Platform version (use
aws eks describe-cluster --name <name> --query cluster.platformVersion
): eks.4 - Kubernetes version (use
aws eks describe-cluster --name <name> --query cluster.version
): 1.19 - AMI Version: amazon-eks-node-1.19-v20210414
- Kernel (e.g.
uname -a
):Linux ip-10-10-10-10.eu-central-1.compute.internal 5.4.105-48.177.amzn2.x86_64 #1 SMP Tue Mar 16 04:56:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
- Release information (run
cat /etc/eks/release
on a node):
BASE_AMI_ID="ami-0849ada759754b5f5"
BUILD_TIME="Wed Apr 14 20:10:54 UTC 2021"
BUILD_KERNEL="5.4.105-48.177.amzn2.x86_64"
ARCH="x86_64"
Please note that when at first we upgraded to k8s 1.19, we used the AMI version v20210329, and we had a lot of node NotReady issues. Based on this comment (https://github.com/awslabs/amazon-eks-ami/issues/195#issuecomment-821195669), we moved to AMI v20210414. The node NotReady does not happen any more. However, we now have System OOM in some of the nodes.
We just completed a 1.16 -> 1.20 upgrade process and we're seeing similar things starting from when we hit 1.19.
We run managed node groups with the EKS-optimized AMI on c4.xl, c5.xl and m5.xl in eu-west-1
and as soon we hit 1.19.6-20210526
we started seeing an increase in pods memory usage, containers being OOM Killed, and System OOM. Given 1.19 for us was just an intermediate step towards 1.20, we continued with the upgrade but the issue is exactly the same on 1.20.4-20210526
.
We'll try an AMI downgrade to 1.20.4-20210519
since it runs a different kernel version hoping this is just something affecting the 20210526
release but I'm not really sure, at this point we're running out of ideas.
Hi @roccozanni , any luck with v20210519
? Thanks
@imriss we're not there yet. We discovered that in order to downgrade to that AMI (or in general, to start using custom ami instead of the default "latest" one) we need to rebuild the nodegroups from scratch, so it's taking a bit more time than initially planned
We tried to downgrade, the issue is still there, also with the very latest AMI version. We also noticed we were still running the 1.7 branch of the CNI plug-in, so we upgraded to 1.8. The issue is still there. We're completely out of ideas at this stage and we don't understand why this cluster is plagued by memory issues all over the place.
Hello, looks like this missed updating this. We're working on root causing the problem as well, one difference is that the 1.19+ AMIs use the Linux 5.4 Kernel. Our current hypothesis is that there was a change in memory management and the OOM Killer in the 5.4 Kernel - https://bugzilla.kernel.org/show_bug.cgi?id=207273#c2
We are investigating this and will provide an update once we have some more details.
@abeer91 thanks for acknowledging this. Do you have any suggestion for any temporary workaround that can be put in place to reduce the likelihood of this to happen while a long term solution is being figured out?
Sorry for the lack of traction here. The AmazonLinux team identified some patches that they expect to resolve this issue, which is likely going to be released sometime in October. We will update here when those are available.
Alternatively, customers have two options to work around this issue until then:
- Downgrade the kernel to 4.14, which is already present in the AMI.
- Upgrade to the 5.10 kernel. We know of one customer who believes this resolved the issue for them. Running
amazon-linux-extras install kernel-5.10
and rebooting the instance should do it.
That being said, neither of these options have been validated through our release testing, so while both kernels are officially supported by AL2, I'd recommend doing so cautiously and ideally testing first.
AmazonLinux has released a new 5.4 kernel that includes some patches that should help with this issue. The latest AMIs, v20211004 or later, will have the patched kernel.
Please let us know if they fixes (or doesn't fix) this issue!
I'm still seeing this issue on some of our larger nodes.
Soem more details on this - We are also seeing this effect larger instances more than smaller ones.
One of our clusters is on things in the size range around m5.2xl and r5.2xl, and is fine. The other is on m5.8xl and r5.12xl sized machines and that has far more errors.
It also appears (without doing stats), that the more recent kernels may make it wrose. Our stablest cluster had kernels based on 5.4.149-73.259.amzn2.x86_64
and other less stable one used 5.4.156-83.273.amzn2.x86_64
. Downgrading back to that 5.4.149
showed immediate improvement.
This is still ongoing, and after some more research, my previous comments were mostly, if not entirely incorrect. The kernel version doesn't seem to matter too much, and the other cluster I mentioned as not being affected is now affected. If I find any more information, or a solution - I'll reply to this thread
I've recently discovered the --eviction-hard
flag for kubelet, and once I set that up, the errors appear to have gone away. I only recently applied this change, so it's too soon to be sure it entirely fixed the problem, but I hope this helps someone else.
EDIT: Update - this is only a partial fix, I combined it with some adjustments to my pod specs to avoid large burstable regions of memory, which has been helpful, but it's still not perfect.
We are also impacted by this. Saw mention in this thread that upgrading to 1.20 does not help. Any ideas if upgrading to 1.21 or 1.22 would help?
We have started the roll out to 1.19-v20211206 base on the suggestion from @mmerkes
It is looking like in our case the "System OOM" events are a false positive. Not sure if this will be the case for others coming to this GH issue.
- The K8s OOM watcher relies on cAdvisor
- K8s version 1.19 uses cAdvisor 0.37.5 (or earlier, depending on K8s build)
- cAdvisor had a bug where it incorrectly parsed OOM events resulting in all OOM events being reported by K8s as "System OOM"
- The bugfix was included in cAdvisor release 0.39.0, which is included included in K8s version v1.21
I've recently discovered the
--eviction-hard
flag for kubelet, and once I set that up, the errors appear to have gone away. I only recently applied this change, so it's too soon to be sure it entirely fixed the problem, but I hope this helps someone else.EDIT: Update - this is only a partial fix, I combined it with some adjustments to my pod specs to avoid large burstable regions of memory, which has been helpful, but it's still not perfect.
@AlexMichaelJonesNC Did this help in your case in the long term?
This should be resolved by the cAdvisor update in k/k; please let us know if you're still seeing this on the latest binaries in s3://amazon-eks
.