AKS icon indicating copy to clipboard operation
AKS copied to clipboard

High memory consumption with v1.25.2

Open smartaquarius10 opened this issue 2 years ago • 149 comments

Team,

Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues.

Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine.

Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue.

I have also observed that AKS 1.24.x version had ubuntu 18 but AKS 1.25.x version has ubuntu 22. Is this the reason behind high memory consumption.

Kindly suggest.

Regards, Tanul


My AKS Configuration:- 8 nodes of Standard B2s size as its a non-prod environment. Pod structure:- Below are the listed pods and their memory consumption except the default microsoft pods(which are taking 4705 Mi of memory in total) running inside cluster

  • Dameon set of AAD pod identity:- Taking total 191 Mi of memory
  • Total 2 pods of kong :- Taking total 914 Mi Memory
  • Daemon set of twistlock vulnerability scanner:- Taking total 1276 Mi of memory
  • Total 10 pods of our .net microservices:- Taking total 820 Mi of memory

smartaquarius10 avatar Jan 31 '23 09:01 smartaquarius10

Hello, We have the same problem with 1.25.4 version in our Company AKS.

We are trying to upgrade an app to openjdk17 to check if this new LTS Java version mitigates the problem.

Edit: In our case, .Net apps needed to change the nugget package for Application Insights.

Greets,

xuanra avatar Feb 01 '23 12:02 xuanra

@xuanra , My major pain point is these 2 pods out of 9 of them

  • ama-logs
  • ama-logs-rs They always takes more that 400 Mi of memory.. Its very difficult to accommodate them in B2S nodes.

My other pain point is these 16 pods(8 each)

  • csi-azuredisk-node
  • csi-azurefile-node

They take 910 Mi of memory. I even raised the support ticket but customer support was unable to figure out whether we are using them or not. In addition, unable to suggest that when we should keep or why we should keep.

Still looking for the better solution to handle the non-prod environment...

smartaquarius10 avatar Feb 01 '23 14:02 smartaquarius10

Hello, we are facing the same problem of memory spikes moving from v1.23.5 to v1.25.4. We had to increase the memory limit of most of the containers

lsavini-orienteed avatar Feb 02 '23 08:02 lsavini-orienteed

@miwithro @ritazh @Karishma-Tiwari-MSFT @CocoWang-wql @jackfrancis @mainred

Hello,

Extremely sorry for tagging you. But our whole non prod environment is not working. We haven't upgraded our prod environment yet. However, engineers are unable to work on their applications.

Few days back, we have approached customer support for node performance issues but did not get any good response.

Would be really grateful for help and support on this as it seems to be a global problem.

smartaquarius10 avatar Feb 02 '23 09:02 smartaquarius10

I need to share one finding. I have just created 2 different AKS clusters with v1.24.9 and v1.25.4 with 1 nodes of Standard B2s

These are the metrics. In case of v 1.25.4 there is a huge spike after enabling monitoring.

image

smartaquarius10 avatar Feb 02 '23 12:02 smartaquarius10

We've got the same problem with memory after upgrading AKS from version 1.24.6 to 1.25.4:

In the monitoring of memory for the last month of one of our deployment, we can clearly see the memory usage increase after the update (01/23): imagen

cedricfortin avatar Feb 03 '23 07:02 cedricfortin

Hello, Our cluster has D4s_v3 machines. We still haven't found any patron in the apps that raised the memory demanded and the apps they don't between all our Java and .Net pods. One alternative to upload Java from 8 to 17 that one of our providers told us is to upload the version of our VM from D4s_v3 to D4s_v5 and we are studing the impact of this change.

Greets,

xuanra avatar Feb 03 '23 11:02 xuanra

@xuanra , I think in that case B2s are totally out of picture for this upgrade.. The max they are capable of supporting is till 1.24.x version of AKS

smartaquarius10 avatar Feb 06 '23 09:02 smartaquarius10

@xuanra , My major pain point is these 2 pods out of 9 of them

  • ama-logs
  • ama-logs-rs They always takes more that 400 Mi of memory.. Its very difficult to accommodate them in B2S nodes.

My other pain point is these 16 pods(8 each)

  • csi-azuredisk-node
  • csi-azurefile-node

They take 910 Mi of memory. I even raised the support ticket but customer support was unable to figure out whether we are using them or not. In addition, unable to suggest that when we should keep or why we should keep.

Still looking for the better solution to handle the non-prod environment...

Hi, @smartaquarius10 , Thanks for the feedback. We have work planned to reduce the ama-logs agent memory foot print and we will update the exact timelines and additional details of the improvements in early March. cc: @pfrcks

ganga1980 avatar Feb 08 '23 02:02 ganga1980

@ganga1980 @pfrcks

Thank you so much Ganga.. We are heavily impacted because of this. Till 1.24.x version of AKS we were running 3 environments within our AKS. But, after upgrading to 1.25.x version we are unable to manage even 1 environment.

Each environment has 11 pods.

Would be grateful for your support on this. I have already disabled the csi pods as we are not using any storage. For now, should we disable these ama monitoring pods as well..

If yes, then once your team resolve these issues should we upgrade our AKS again to some specific version or microsoft will resolve from backend in every version of AKS infra.

Thank you

Kind Regards, Tanul

smartaquarius10 avatar Feb 13 '23 08:02 smartaquarius10

Hello @ganga1980 @pfrcks ,

Hope you are doing well. By any chance, is it possible to speed up the process a little.. Actually our 2 environments (which is 22 micro services) are down because of this.

Appreciate your help and support in this matter. Thank you. Have a great day.

Hello @xuanra @cedricfortin @lsavini-orienteed, Did you find any workaround for this. Thanks :)

Kind Regards, Tanul

smartaquarius10 avatar Feb 24 '23 07:02 smartaquarius10

Hi @smartaquarius10, we updated the k8s version of AKS to 1.25.5 this week and start suffering from the same issue.

In our case, we identified a problem with the JRE version when dealing with cgroups v2. Here I share my findings:

Kubernetes cgroups v2 reached GA on the version 1.25.x and with this change AKS changed the OS of the nodes from Ubuntu18.04 to Ubuntu22.04 that already uses cgroups v2 by default.

The problem of our containarized apps were related with a bug on JRE 11.0.14. This JRE didn't had support for cgroups v2 container awareness. This means that the container were not able to respect the imposed memory quotas defined on the deployment descriptor.

Oracle and OpenJDK addressed this issue by supporting it natively on JRE 17 and backporting this fix to JRE 15 and JRE 11.0.16++.

I've updated the base image to use a fixed JRE version (11.0.18) and the memory exhaustion was solved.

Regarding AMA pods, I've compared the pods running on k8s 1.25.X with the pods running on 1.24.X and in my opinion seems stable as the memory footprint is literally the same.

Hope this helps!

gonpinho avatar Feb 24 '23 09:02 gonpinho

@gonpinho , Thanks a lot for sharing the details. But the problem is that our containerized apps are not taking extra memory.. They are still occupying the same as they were taking before with 1.24.x..

What I realized is that I have created a fresh cluster 1.24.x and 1.25.x and by default memory occupancy is appox. 30% more in 1.25.x..

My one environment takes only 1 GB of memory consisting of 11 pods.. With AKS 1.24.x I was running 3 environments in total. The moment I shifted to 1.25.x I have to disable 2 environments along with the microsoft CSI addons as well just to accommodate the 11 custom pods because the node memory consumption is already high.

smartaquarius10 avatar Feb 24 '23 10:02 smartaquarius10

@gonpinho , By any chance if I can downgrade the OS again to ubuntu 18.0.4 then it would be my first preference. I know that upgrade to ubuntu OS is killing the machines. No idea how to handle this.

smartaquarius10 avatar Feb 24 '23 10:02 smartaquarius10

Hi, we facing with the same problem after upgrading our dev AKS cluster to 1.25.5 from 1.23.12. Our company develops c/c++ and c# services, so we don't suffer from JRE cgroup v2 issues. We see that memory usage is increasing over time, but nothing - just kube-system pods - are running on the cluster. The sympthoms is that kubectl top no shows much more memory consumption than free on the host OS (ubuntu 22.04). If we force host OS to drop cached memory with the command sudo sh -c 'echo 1 > /proc/sys/vm/drop_caches' the used memory isn't changing but some of the buff/cache memory moves to free, and after it the kubectl top no shows memory usage drop on that node. We came to conclusion, that k8s calculates buff/cache memory into used memory, but it is wrong, because linux OS will use free memory to buffer IO and other things, and it is completely normal operation.

kubectl top no before cache drop: Screenshot_20230302_104524

free before / after cache drop: Screenshot_20230302_104702

kubectl top no after cache drop: Screenshot_20230302_104737

pintmil avatar Mar 02 '23 09:03 pintmil

Team, we are seeing the same behaviour after upgrading the cluster from 1.23.12 to 1.25.5. All the microservices running in clusters are .Net3.1. On raising a support request, we got to know that cgroup version has been changes to v2, does anyone have similar scenario. How do we identify cgroup v1 is used in .net 3.1 and can it be the cause for high memory consumption,

shiva-appani-hash avatar Mar 04 '23 21:03 shiva-appani-hash

Hello @ganga1980, Any update on this please.. Thank you

smartaquarius10 avatar Mar 06 '23 11:03 smartaquarius10

Hello @ganga1980, Any update on this please.. Thank you @smartaquarius10 , We are working on rolling out our March agent release, which would bring down the usage ama-logs daemonset (linux) by 80 to 100MB. I dont have your cluster name or cluster resource id to investigate and we cant repro the issue you have reported. Please create an support ticket with clusterResourceId details so that we can investigate. The workaround you can try applying the default configmap through kubectl apply -f https://raw.githubusercontent.com/microsoft/Docker-Provider/ci_prod/kubernetes/container-azm-ms-agentconfig.yaml

ganga1980 avatar Mar 07 '23 05:03 ganga1980

@ganga1980 , Thank you for the reply. Just a quick question. After raising the support ticket should I send a mail to you on your microsoft id with the details regarding support ticket. Otherwise, it will assign to L1 support which will take a lot of time to get to the resolution.

Or else, if you allow, I can ping you my cluster details on MS teams.

The way you like 😃

Currently, ama pods are taking approx. 326Mi of memory/node

smartaquarius10 avatar Mar 07 '23 07:03 smartaquarius10

@ganga1980, We already have this config map

smartaquarius10 avatar Mar 07 '23 07:03 smartaquarius10

@ganga1980 for the csi driver resource usage, if you don't need csi driver, you could disable those drivers, follow by: https://learn.microsoft.com/en-us/azure/aks/csi-storage-drivers#disable-csi-storage-drivers-on-a-new-or-existing-cluster

andyzhangx avatar Mar 07 '23 13:03 andyzhangx

Hi! It seems we are facing the same issue in 1.25.5. We upgraded a few weeks (24.02) ago and the memory usage (container working set memory) jumped from the moment of the upgrade, according to the metrics tab: Screenshot 2023-03-09 at 18 06 54 copy

We are using Standard_B2s vms, as this is an internal development cluster - csi drivers are not enabled. Is the issue identified or is there still an investigation on this?

Marchelune avatar Mar 10 '23 14:03 Marchelune

Same issue here after upgrading to 1.25.5. We are using FS2_v2 and we were not able to have the Working Set memory below 100% no matter how many nodes we added to the cluster.

Very dissapointing that all the memory in the Node is used and reserved by Azure Pods.

We had to disable Azure Insights in the cluster.

image

codigoespagueti avatar Mar 10 '23 17:03 codigoespagueti

@vishiy, @saaror would you be able to assist?

Issue Details

Team,

Since the day I have updated the AKS to v1.25.2, I can see huge spikes and node memory pressure issues.

Pods are going in evicted state and nodes are always consuming 135 to 140% of memory.. Till the time I was at 1.24.9 everything was working fine.

Just now, I saw that portal.azure.com has removed the v1.25.2 version from Create new-->Azure kubernetes cluster section. Does this version of AKS has any problem. Should we immediately switch to v1.25.4 for resolving memory issue.

I have also observed that AKS 1.24.x version had ubuntu 18 but AKS 1.25.x version has ubuntu 22. Is this the reason behind high memory consumption.

Kindly suggest.


My AKS Configuration:- 8 nodes of Standard B2s size as its a non-prod environment. Pod structure:- Below are the listed pods and their memory consumption except the default microsoft pods(which are taking 4705 Mi of memory in total) running inside cluster

  • Dameon set of AAD pod identity:- Taking total 191 Mi of memory
  • Total 2 pods of kong :- Taking total 914 Mi Memory
  • Daemon set of twistlock vulnerability scanner:- Taking total 1276 Mi of memory
  • Total 10 pods of our .net microservices:- Taking total 820 Mi of memory
Author: smartaquarius10
Assignees: -
Labels:

bug, azure/oms, addon/container-insights

Milestone: -

ghost avatar Mar 10 '23 18:03 ghost

@codigoespagueti @Marchelune Yeah, even we are planning to disable azure insights(ama agent pods). However, we are performing few steps for enabling at least one more environment. Not having at least 2 environments was highly jeopardizing the productivity of my team members. For now, at least out of 2 environments are working out of 3 environments.

  • sync;echo 1 > /proc/sys/vm/drop_caches
  • Disabled csi drivers
  • Disabled the custom metrics server and custom operator created for the autoscaling of pods
  • Disabled 1 environment consisting of 11 pods
  • Daily execute a job in the morning to remove the entries of evicted pods
  • Rather than performing rolling update we delete the pods first and then create the new one within the CI/CD because our memory peak is already at 136 to 140%. Being on edge, its very difficult to perform parallel rolling update deployments.

Now, we are waiting till the end of march as @ganga1980 team is working on the ama agent pods. If it worked then cool otherwise, we will disable monitoring pods as well.

Kind Regards, Tanul

smartaquarius10 avatar Mar 10 '23 18:03 smartaquarius10

Same problem here this is a single pod before and after update with the same codebase 19dadab7-11e9-4f68-aa6c-414bd424a3c3

JonasJes avatar Mar 14 '23 08:03 JonasJes

This might help some of you, Kubernetes 1.25 included an update to use cgroups v2 api (cgroups is basically how Kubernetes passes settings to the containers).

When this happened on docker-desktop for me, the memory limits on containers simply stopped having any effect - if you asked the container about it's memory it would basically report the amount of system memory on the host.

My solution was to re-enable the deprecated cgroupsv1 api and it all magically worked again ...

So long as you are using a new enough linux kernel I believe cgroupsv2 should work, but it didn't work for me and I'm yet to work out exactly why, but I strongly suggest all these issues are regarding the cgroups change - it DOESN'T only affect java as I think some people seem to believe, it's a linux kernel thing.

Here is a link about the change : https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/

unluckypixie avatar Mar 16 '23 11:03 unluckypixie

@unluckypixie , Thanks for sharing. How to enable that in AKS. Could you please share the details. Thank you

smartaquarius10 avatar Mar 17 '23 05:03 smartaquarius10

Hi Team, We are also seeing high memory consumptions after AKS Upgrade!. Do we have any resolutions yet?.

NattyPradeep avatar Mar 17 '23 19:03 NattyPradeep

@unluckypixie, Could you please share the process of re-enabling the cgroups v1

smartaquarius10 avatar Mar 18 '23 06:03 smartaquarius10