AKS icon indicating copy to clipboard operation
AKS copied to clipboard

ama-metrics-operator-targets consuming more and more cluster memory

Open ChrisJD-VMC opened this issue 6 months ago • 19 comments

Describe the bug I don't know if this is the correct place or this, if it's not please advise where to direct this issue. tldr; ama-metrics-operator-targets seems to have a memory leak (I assume it's not designed to slowly consume more and more RAM).

I got alerts from both AKS clusters I run this morning that a container in each had been OOM killed. Some investigation revealed that the containers in question were both the ama-metrics-operator-targets (Azure Managed Prometheus monitoring related, to my understanding).

Looking at the memory usage for those containers in Prometheus I can see a ramp up in memory usage over the course of probably a bit more than a week followed by the containers being killed at about 2GB of ram usage. The memory use then drops back to 60-70MB and then starts climbing again.

This is the first time this has happened. We've been using Azure Managed Prometheus for about 3 months. Given the rate the RAM usage is increasing at I assume some kind of new issue is causing this. Probably introduced in the last couple of weeks. We have not made any changes to either clusters configuration for several months. And one of the clusters hasn't had any container changes deployed by us for 3 months. Both are configured to auto update for minor cluster versions.

To Reproduce Steps to reproduce the behavior: I assume just having a cluster configured with Prometheus monitoring is enough.

Expected behavior ama-metrics-operator-targets container RAM usage does not continuously grow over time.

Screenshots 7 days ago image Last night image After the OOM kill occurred image Climbing again image

Environment (please complete the following information): CLI Version - 2.62.0 Kubernetes version - 1.29.7 and 1.30.3 Browser - chrome

Additional Info Clusters are in two different regions. Connected using AMPLS to the same Azure Monitor Workspace. One Azure Managed Prometheus instance connected to the workspace. Data still appears to be being collected and can be viewed fine in Prometheus.

ChrisJD-VMC avatar Aug 26 '24 20:08 ChrisJD-VMC