Memory utilization is higher in some EKS workloads
Image I'm using:
AMI ID: ami-0f3b9574af04c5bf4 BottleRocket Release: bottlerocket-aws-k8s-1.28-x86_64-v1.16.0-d2d9cf87
What I expected to happen:
After switching from the standard Amazon Linux 2 AMI for EKS, we rolled out the nodes with BottleRocket. The workloads should have stayed with the same performance benchmark.
What actually happened:
We noticed some of the same workloads which we had before utilize much more memory than before and reach OOM. After making the memory limit larger, it set on a certain number, which was higher by about 1-1.5 GiB than the memory usage before.
How to reproduce the problem:
Deploy various kind of workloads firstly on any standard Amazon Linux 2 AMI on EKS, and then update the nodes to be Bottlerocket using a rolling update. Measure the workloads' memory utilization.
@ElementTech thank for opening the issue! we are looking at it now, and we will reach back soon.
we are trying to reproduce the issue, can you share with us how you measure the memory utilization so we can use the same way to do the reproduce. Meanwhile, if you have any other details or examples that can share with us, it would be helpful for us to narrow down the issue! thank you!
@gthao313 I'd sugest taking a look at https://github.com/kubernetes/kubernetes/issues/118916.
@stevehipwell Yes, that was it. I found it before and wanted to write it here. It appears as though BottleRocket obviously uses cgroup v2 by default, while the vanilla Amazon Linux 2 for EKS uses cgroup v1. It might be beneficial to write a note about it in the bottlerocket docs. Also, as a current fix, I added this to bottlerocket toml config:
[settings.boot]
reboot-to-reconcile = true
[settings.boot.init-parameters]
"systemd.unified_cgroup_hierarchy" = ["0"]
Closing this as the underlying cause and work-around have been identified.
@webern do you have a link to the fix?
I misunderstood and didn't realize a fix was needed in Bottlerocket for this. Re-opening.
@ElementTech Is that a java application?If yes, maybe you should look into the following links. https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/#migrate-to-cgroup-v2
@webern I think this should be fixed when the K8s patch for runc is backported and Bottlerockt bumps to that version. So it might only be fixed in some K8s versions.
anyone else has experienced similar issues with java application running on AWS ECS?