AKS icon indicating copy to clipboard operation
AKS copied to clipboard

ama-metrics-operator-targets consuming more and more cluster memory

Open ChrisJD-VMC opened this issue 1 year ago • 19 comments

Describe the bug I don't know if this is the correct place or this, if it's not please advise where to direct this issue. tldr; ama-metrics-operator-targets seems to have a memory leak (I assume it's not designed to slowly consume more and more RAM).

I got alerts from both AKS clusters I run this morning that a container in each had been OOM killed. Some investigation revealed that the containers in question were both the ama-metrics-operator-targets (Azure Managed Prometheus monitoring related, to my understanding).

Looking at the memory usage for those containers in Prometheus I can see a ramp up in memory usage over the course of probably a bit more than a week followed by the containers being killed at about 2GB of ram usage. The memory use then drops back to 60-70MB and then starts climbing again.

This is the first time this has happened. We've been using Azure Managed Prometheus for about 3 months. Given the rate the RAM usage is increasing at I assume some kind of new issue is causing this. Probably introduced in the last couple of weeks. We have not made any changes to either clusters configuration for several months. And one of the clusters hasn't had any container changes deployed by us for 3 months. Both are configured to auto update for minor cluster versions.

To Reproduce Steps to reproduce the behavior: I assume just having a cluster configured with Prometheus monitoring is enough.

Expected behavior ama-metrics-operator-targets container RAM usage does not continuously grow over time.

Screenshots 7 days ago image Last night image After the OOM kill occurred image Climbing again image

Environment (please complete the following information): CLI Version - 2.62.0 Kubernetes version - 1.29.7 and 1.30.3 Browser - chrome

Additional Info Clusters are in two different regions. Connected using AMPLS to the same Azure Monitor Workspace. One Azure Managed Prometheus instance connected to the workspace. Data still appears to be being collected and can be viewed fine in Prometheus.

ChrisJD-VMC avatar Aug 26 '24 20:08 ChrisJD-VMC

Started at 24 Megs for the config-reader 4 days ago, today at 600 Megs: image image

boyko11 avatar Aug 27 '24 20:08 boyko11

We have the similar issue. The memory usage keeps growing and drops down suddenly

akari-m avatar Aug 28 '24 07:08 akari-m

Hi - this is a known issue that we will be rolling out a fix for.

vishiy avatar Aug 28 '24 22:08 vishiy

Hi @vishiy Do we have any ETA for the fix? Thanks.

JoeyC-Dev avatar Aug 29 '24 08:08 JoeyC-Dev

Hi @vishiy

Is there any way we could free the memory of the ama-metrics-operator-targets pod like manually kill the pod?

Will killing the pod make any impacts in AKS cluster?

akari-m avatar Aug 30 '24 02:08 akari-m

Hi @vishiy

Is there any way we could free the memory of the ama-metrics-operator-targets pod like manually kill the pod?

Will killing the pod make any impacts in AKS cluster?

@akari-m , I opened a support ticket for this issue. The support engineer had me delete all the pods whose name start with ama-*. The pods recreated automatically, we reclaimed over 2Gigs of memory, no impact to our application pods…

boyko11 avatar Aug 30 '24 04:08 boyko11

Happened today to our production cluster as well. CC @vishiy.

In general it seems like the Azure Monitor for AKS is not in a good shape. It's very convinient one-click deployment, but the stability and quality of the setup is pretty damn low for a paid product.

iakovmarkov avatar Sep 06 '24 16:09 iakovmarkov

This started happening to us on 16th of August 2024 (see chart), both in WestEurope and WestUS clusters, all by itself. We are on 1.28.9 AKS version.

The leak causes node reboots for us, and considering this component is runnning on the system nodepool, leads all sorts of bad effects.

@vishiy is there any ETA for the fix you mentioned, or is there anything else we can do to resolve the issue?


EDIT: I also note that the targetallocator container in ama-metrics-operator-targets deployment has 8Gi memory limit and 5 cores. Surely these cannot be reasonable numbers?

containers:
  - name: targetallocator
    image: mcr.microsoft.com/azuremonitor/containerinsights/ciprod/prometheus-collector/images:6.9.0-main-07-22-2024-2e3dfb56-targetallocator
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        cpu: "5"
        memory: 8Gi
      requests:
        cpu: 10m
        memory: 50Mi

image

ppanyukov avatar Sep 10 '24 13:09 ppanyukov

We have the same issue.

adejongh avatar Sep 11 '24 08:09 adejongh

Hello @vishiy . We have the same issue. Please assist

AkariH avatar Sep 11 '24 09:09 AkariH

This issue also causes OOMkills for us.

Temporary solution was to create a cronjob that restarts the operator every day so that it doesn't consume a lot of memory and causes issues on the nodes.

Please notify in here once it is resolved

twanbeeren avatar Sep 13 '24 09:09 twanbeeren

It has happened again, exactly 7 days after last time. I don't want this to become a weekly event in my job, so I've also created a cronjob to kill the ama-metrics-* pods.

Again, not what I'd expect from a commercial product.

iakovmarkov avatar Sep 13 '24 12:09 iakovmarkov

Can someone post a 1-liner kubectl command to create this cron job as a temporary workaround? ;)

deyanp avatar Sep 13 '24 13:09 deyanp

Cronjob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: kill-ama-operator-cj
  namespace: kube-system
spec:
  schedule: "0 6 * * *" # Runs every day at 6:00 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: kill-pod
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              POD=$(kubectl get pods -n kube-system -l rsName=ama-metrics-operator-targets -o jsonpath='{.items[0].metadata.name}')
              kubectl delete pod $POD -n kube-system
          restartPolicy: OnFailure

If you want to check the Cronjob deployment: kubectl create job test-job --from=cronjob/kill-ama-operator-cj -n kube-system

If your job fails with a kubectl error you probably need to add a serviceaccount

twanbeeren avatar Sep 13 '24 15:09 twanbeeren

Can confirm this is happening on our clusters as well.

antiphon0 avatar Sep 14 '24 20:09 antiphon0

Hi - this is a known issue that we will be rolling out a fix for.

Is this ever going to be fixed or it would be my fever dream forever?

shiroshiro14 avatar Sep 19 '24 01:09 shiroshiro14

Hi, same issue 3 weeks from now. Can we have an estimated date for the fix ? @vishiy

martindruart avatar Sep 20 '24 06:09 martindruart

@twanbeeren , I allowed myself to extend your version:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: kill-ama-metrics-operator-targets-cj-sa
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: kube-system
  name: kill-ama-metrics-operator-targets-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: kill-ama-metrics-operator-targets-cj-sa-binding
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: kill-ama-metrics-operator-targets-cj-sa
    namespace: kube-system
roleRef:
  kind: Role
  name: kill-ama-metrics-operator-targets-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: kill-ama-metrics-operator-targets-cj
  namespace: kube-system
spec:
  schedule: "0 6 * * *" # Runs every day at 6:00 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: kill-ama-metrics-operator-targets-cj-sa
          containers:
          - name: kill-pod
            image: bitnami/kubectl:latest
            command:
            - /bin/sh
            - -c
            - |
              POD=$(kubectl get pods -n kube-system -l rsName=ama-metrics-operator-targets -o jsonpath='{.items[0].metadata.name}')
              kubectl delete pod $POD -n kube-system
          restartPolicy: OnFailure
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1

deyanp avatar Sep 20 '24 07:09 deyanp

The fix for this is rolling out currently. It should roll out to all regions by 09/30.

rashmichandrashekar avatar Sep 20 '24 18:09 rashmichandrashekar

The fix for this is rolling out currently. It should roll out to all regions by 09/30. I am still facing this issue.

sivashankaran22 avatar Oct 01 '24 10:10 sivashankaran22

The fix for this is rolling out currently. It should roll out to all regions by 09/30. I am still facing this issue.

@sivashankaran22 - Could you provide me your cluster id?

rashmichandrashekar avatar Oct 01 '24 15:10 rashmichandrashekar

The fix for this is rolling out currently. It should roll out to all regions by 09/30.

What is the version of the fix that your are rolling out, so we can check that has been deployed?

We still see very high memory use for all the "ama-" pods, which is really not acceptable.

image

adejongh avatar Oct 04 '24 10:10 adejongh

Memory usage has been back to normal for a while for us. So I'm going to close this as it is resolved as far as I am concerned

ChrisJD-VMC avatar Nov 26 '24 23:11 ChrisJD-VMC