amazon-eks-ami icon indicating copy to clipboard operation
amazon-eks-ami copied to clipboard

EKS k8s 1.19 - AMI 1.19-v20210414 - (combined from similar events): System OOM encountered, victim process:

Open imriss opened this issue 3 years ago • 17 comments

What happened: System OOM happens in some of nodes after upgrade to ks8 1.19 (using AMI v20210414). It seems this happens more for bigger nodes, like r5.4x and r5.8x.

  Warning  SystemOOM  18m                   kubelet, ip-10-10-10-10.eu-central-1.compute.internal  System OOM encountered, victim process: iptables, pid: 13778
  Warning  SystemOOM  18m                   kubelet, ip-10-10-10-10.eu-central-1.compute.internal  System OOM encountered, victim process: iptables, pid: 13782
  Warning  SystemOOM  18m                   kubelet, ip-10-10-10-10.eu-central-1.compute.internal  System OOM encountered, victim process: iptables, pid: 13836
  Warning  SystemOOM  18m                   kubelet, ip-10-10-10-10.eu-central-1.compute.internal  System OOM encountered, victim process: iptables, pid: 13853
  Warning  SystemOOM  17m                   kubelet, ip-10-10-10-10.eu-central-1.compute.internal  System OOM encountered, victim process: iptables, pid: 18796
  Warning  SystemOOM  17m                   kubelet, ip-10-10-10-10.eu-central-1.compute.internal  System OOM encountered, victim process: iptables, pid: 18808
  Warning  SystemOOM  17m                   kubelet, ip-10-10-10-10.eu-central-1.compute.internal  System OOM encountered, victim process: iptables, pid: 18819
  Warning  SystemOOM  17m (x538 over 2d6h)  kubelet, ip-10-10-10-10.eu-central-1.compute.internal  (combined from similar events): System OOM encountered, victim process: iptables, pid: 18883
  Warning  SystemOOM  17m                   kubelet, ip-10-10-10-10.eu-central-1.compute.internal  System OOM encountered, victim process: iptables, pid: 18854
  Warning  SystemOOM  17m                   kubelet, ip-10-10-10-10.eu-central-1.compute.internal  System OOM encountered, victim process: iptables, pid: 18880

What you expected to happen: We did not have System OOM issue in EKS k8s 1.18

How to reproduce it (as minimally and precisely as possible): EKS k8s 1.19 Nodes AMI: amazon-eks-node-1.19-v20210414 Region: Frankfurt (eu-central-1)

Anything else we need to know?:

Environment:

  • AWS Region: Frankfurt
  • Instance Type(s): r5.4xlarge, r5.8xlarge
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.4
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.19
  • AMI Version: amazon-eks-node-1.19-v20210414
  • Kernel (e.g. uname -a): Linux ip-10-10-10-10.eu-central-1.compute.internal 5.4.105-48.177.amzn2.x86_64 #1 SMP Tue Mar 16 04:56:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-0849ada759754b5f5"
BUILD_TIME="Wed Apr 14 20:10:54 UTC 2021"
BUILD_KERNEL="5.4.105-48.177.amzn2.x86_64"
ARCH="x86_64"

imriss avatar May 25 '21 14:05 imriss

Please note that when at first we upgraded to k8s 1.19, we used the AMI version v20210329, and we had a lot of node NotReady issues. Based on this comment (https://github.com/awslabs/amazon-eks-ami/issues/195#issuecomment-821195669), we moved to AMI v20210414. The node NotReady does not happen any more. However, we now have System OOM in some of the nodes.

imriss avatar May 25 '21 14:05 imriss

We just completed a 1.16 -> 1.20 upgrade process and we're seeing similar things starting from when we hit 1.19.

We run managed node groups with the EKS-optimized AMI on c4.xl, c5.xl and m5.xl in eu-west-1 and as soon we hit 1.19.6-20210526 we started seeing an increase in pods memory usage, containers being OOM Killed, and System OOM. Given 1.19 for us was just an intermediate step towards 1.20, we continued with the upgrade but the issue is exactly the same on 1.20.4-20210526.

We'll try an AMI downgrade to 1.20.4-20210519 since it runs a different kernel version hoping this is just something affecting the 20210526 release but I'm not really sure, at this point we're running out of ideas.

roccozanni avatar Jun 04 '21 16:06 roccozanni

Hi @roccozanni , any luck with v20210519? Thanks

imriss avatar Jun 13 '21 23:06 imriss

@imriss we're not there yet. We discovered that in order to downgrade to that AMI (or in general, to start using custom ami instead of the default "latest" one) we need to rebuild the nodegroups from scratch, so it's taking a bit more time than initially planned

roccozanni avatar Jun 15 '21 07:06 roccozanni

We tried to downgrade, the issue is still there, also with the very latest AMI version. We also noticed we were still running the 1.7 branch of the CNI plug-in, so we upgraded to 1.8. The issue is still there. We're completely out of ideas at this stage and we don't understand why this cluster is plagued by memory issues all over the place.

roccozanni avatar Jul 20 '21 18:07 roccozanni

Hello, looks like this missed updating this. We're working on root causing the problem as well, one difference is that the 1.19+ AMIs use the Linux 5.4 Kernel. Our current hypothesis is that there was a change in memory management and the OOM Killer in the 5.4 Kernel - https://bugzilla.kernel.org/show_bug.cgi?id=207273#c2

We are investigating this and will provide an update once we have some more details.

abeer91 avatar Jul 20 '21 19:07 abeer91

@abeer91 thanks for acknowledging this. Do you have any suggestion for any temporary workaround that can be put in place to reduce the likelihood of this to happen while a long term solution is being figured out?

roccozanni avatar Aug 03 '21 19:08 roccozanni

Sorry for the lack of traction here. The AmazonLinux team identified some patches that they expect to resolve this issue, which is likely going to be released sometime in October. We will update here when those are available.

Alternatively, customers have two options to work around this issue until then:

  1. Downgrade the kernel to 4.14, which is already present in the AMI.
  2. Upgrade to the 5.10 kernel. We know of one customer who believes this resolved the issue for them. Running amazon-linux-extras install kernel-5.10 and rebooting the instance should do it.

That being said, neither of these options have been validated through our release testing, so while both kernels are officially supported by AL2, I'd recommend doing so cautiously and ideally testing first.

mmerkes avatar Sep 21 '21 17:09 mmerkes

AmazonLinux has released a new 5.4 kernel that includes some patches that should help with this issue. The latest AMIs, v20211004 or later, will have the patched kernel.

Please let us know if they fixes (or doesn't fix) this issue!

mmerkes avatar Oct 12 '21 18:10 mmerkes

I'm still seeing this issue on some of our larger nodes.

AlexMichaelJonesNC avatar Dec 22 '21 21:12 AlexMichaelJonesNC

Soem more details on this - We are also seeing this effect larger instances more than smaller ones.

One of our clusters is on things in the size range around m5.2xl and r5.2xl, and is fine. The other is on m5.8xl and r5.12xl sized machines and that has far more errors.

It also appears (without doing stats), that the more recent kernels may make it wrose. Our stablest cluster had kernels based on 5.4.149-73.259.amzn2.x86_64 and other less stable one used 5.4.156-83.273.amzn2.x86_64. Downgrading back to that 5.4.149 showed immediate improvement.

AlexMichaelJonesNC avatar Dec 23 '21 16:12 AlexMichaelJonesNC

This is still ongoing, and after some more research, my previous comments were mostly, if not entirely incorrect. The kernel version doesn't seem to matter too much, and the other cluster I mentioned as not being affected is now affected. If I find any more information, or a solution - I'll reply to this thread

AlexMichaelJonesNC avatar Jan 07 '22 08:01 AlexMichaelJonesNC

I've recently discovered the --eviction-hard flag for kubelet, and once I set that up, the errors appear to have gone away. I only recently applied this change, so it's too soon to be sure it entirely fixed the problem, but I hope this helps someone else.

EDIT: Update - this is only a partial fix, I combined it with some adjustments to my pod specs to avoid large burstable regions of memory, which has been helpful, but it's still not perfect.

AlexMichaelJonesNC avatar Jan 07 '22 23:01 AlexMichaelJonesNC

We are also impacted by this. Saw mention in this thread that upgrading to 1.20 does not help. Any ideas if upgrading to 1.21 or 1.22 would help?

maxenglander avatar Jan 13 '22 14:01 maxenglander

We have started the roll out to 1.19-v20211206 base on the suggestion from @mmerkes

imriss avatar Jan 13 '22 19:01 imriss

It is looking like in our case the "System OOM" events are a false positive. Not sure if this will be the case for others coming to this GH issue.

  1. The K8s OOM watcher relies on cAdvisor
  2. K8s version 1.19 uses cAdvisor 0.37.5 (or earlier, depending on K8s build)
  3. cAdvisor had a bug where it incorrectly parsed OOM events resulting in all OOM events being reported by K8s as "System OOM"
  4. The bugfix was included in cAdvisor release 0.39.0, which is included included in K8s version v1.21

maxenglander avatar Jan 25 '22 00:01 maxenglander

I've recently discovered the --eviction-hard flag for kubelet, and once I set that up, the errors appear to have gone away. I only recently applied this change, so it's too soon to be sure it entirely fixed the problem, but I hope this helps someone else.

EDIT: Update - this is only a partial fix, I combined it with some adjustments to my pod specs to avoid large burstable regions of memory, which has been helpful, but it's still not perfect.

@AlexMichaelJonesNC Did this help in your case in the long term?

Satheesh-Balachandran avatar Feb 22 '22 17:02 Satheesh-Balachandran

This should be resolved by the cAdvisor update in k/k; please let us know if you're still seeing this on the latest binaries in s3://amazon-eks.

cartermckinnon avatar Dec 01 '22 22:12 cartermckinnon