alertmanager icon indicating copy to clipboard operation
alertmanager copied to clipboard

Memory leak when repeating silences

Open ngc104 opened this issue 3 years ago • 52 comments

What did you do?

I'm using Kthxbye. When an alert fires and I add a silence with Kthxbye, the memory usage of Alertmanager increases.

You can reproduce this without Kthxbye :

1/ generate an alert (or use any alert sent by Prometheus), for example PrometheusNotIngestingSamples.

2/ With Alertmanager, generate silences like this :

while true; do
  date # useless; just for tracing...
  amtool --alertmanager.url http://localhost:9093 silence add alertname=PrometheusNotIngestingSamples -a "MemoryLeakLover" -c "Test memory leak in Alertmanager" -d "1m"
  sleep 50
done

Note : The behaviour of Kthxbye is similar, but default config is 15 min instead of 1 min. However, with amtool you can see that Kthxbye has nothing to do with this bug.

What did you expect to see?

Nothing interesting (no abnormal memory increase)

What did you see instead? Under which circumstances?

Follow the metric container_memory_working_set_bytes for Alertmanager. After some hours you can see it slightly grow up.

Here is a screenshot of the above test, for a little more than 12 hours : test started at 12h20 and finished at 9h the day after.

image

My Alertmanager is running with the default --data.retention=120h. I guessed that after 5 days it would stop increasing. Wrong guess : it stops increasing only at OOM and automatic kill.

image The above graph was made with Kthxbye running. The pod restarts after an OOM (left side) or after a kubectl delete pod (right side).

Environment

  • System information:

    Kubernetes (deployed with https://github.com/prometheus-community/helm-charts/tree/main/charts/alertmanager)

  • Alertmanager version:

/alertmanager $ alertmanager --version
alertmanager, version 0.21.0 (branch: HEAD, revision: 4c6c03ebfe21009c546e4d1e9b92c371d67c021d)
  build user:       root@dee35927357f
  build date:       20200617-08:54:02
  go version:       go1.14.4
  • Alertmanager configuration file:
/alertmanager $ cat /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
receivers:
- name: rocketchat
  webhook_configs:
  - send_resolved: true
    url: https://xxxx.rocketchat.xxxx/hooks/xxxxxx/xxxxxxxxx
route:
  group_by:
  - xxxxxxx
  - yyyyyyy
  - alertname
  group_interval: 5m
  group_wait: 30s
  receiver: rocketchat
  repeat_interval: 5m
  routes:
  - continue: true
    receiver: rocketchat
templates:
- /etc/alertmanager/*.tmpl

  • Logs:
➜ k -n monitoring logs caascad-alertmanager-0 
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-07-30T09:09:46.139Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.171Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/alertmanager.yml
level=info ts=2021-07-30T09:09:46.174Z caller=main.go:485 msg=Listening address=:9093
level=warn ts=2021-07-30T12:29:49.530Z caller=notify.go:674 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify attempt failed, will retry later" attempts=1 err="Post \"https://xxxx.rocketchat.xxx/hooks/xxxxxx/xxxxxxxxx\": dial tcp x.x.x.x: connect: connection refused"
level=info ts=2021-07-30T12:32:17.213Z caller=notify.go:685 component=dispatcher receiver=rocketchat integration=webhook[0] msg="Notify success" attempts=13

ngc104 avatar Jul 30 '21 13:07 ngc104

Thanks for your report.

Would you be able to take and upload a profile with http://alertmanager/debug/pprof/heap when the memory is high and provide it to us ?

roidelapluie avatar Aug 04 '21 08:08 roidelapluie

Hello, I'm on holidays since today... Too bad...

However I'm not alone on this issue. You should have the profile from someone else soon (let us wait some hours to reproduce the bug again)

Regards

ngc104 avatar Aug 04 '21 11:08 ngc104

Hi @roidelapluie

I'm in @ngc104 's team. During his holidays I will follow this issues. I tried to reproduce the bug yesterday but unfortunatelly the pod was restarted before I was able to extracted the profile and the heap. I'll try to provide you those informations as soon as I can :)

Thx for your help

Regards,

Winael avatar Aug 05 '21 08:08 Winael

Hi again @roidelapluie

I was able to reproduce the issue this morning

Capture d’écran de 2021-08-05 11-23-32

Please find the profile here and heap here

Let me know if you need more informations

Regards,

Winael avatar Aug 05 '21 12:08 Winael

Hi guys

Any update on this case ? Tell me if you're need mode informations

Regards,

Winael avatar Aug 17 '21 09:08 Winael

Hi @roidelapluie

Can we have an update on this issues ? Do you need more informations ?

Regards

Winael avatar Aug 25 '21 09:08 Winael

Hi everyone

Will it be possible to have on update on this case ? Could become critical on our side

Regards,

Winael avatar Sep 07 '21 08:09 Winael

Hi everyone, @roidelapluie

Thanks for your report.

Would you be able to take and upload a profile with http://alertmanager/debug/pprof/heap when the memory is high and provide it to us ?

I sent you the element requested in the begining of august. Could you help us on this case please ?

Regards Winael

Winael avatar Sep 29 '21 15:09 Winael

Hello,

Somehow the heap dump does not seem to help. I see only a few megabytes of memory used here. I will try to reproduce this on my side asap.

roidelapluie avatar Sep 29 '21 23:09 roidelapluie

Hi @roidelapluie

Have you succedded to reproduce this issue on your side ? Do you have some informations that could help us please ?

Regards, Winael

Winael avatar Nov 02 '21 09:11 Winael

i did not have the chance yet. maybe you can come up with a pprof that shows the issue ? the one that was sent previously was not really conclusive.

roidelapluie avatar Nov 02 '21 13:11 roidelapluie

Edited on 13/06/2023 (typo and switch to 0.25.0) Edited on 29/08/2023 (switch to 0.26.0 + run alertmanager --version to show the version + added trace.sh)

Procedure to reproduce the bug

Get alertmanager

(yes, sorry, so complex for just curling a tar.gz and uncompress it)

ALERTMANAGER_VERSION="0.26.0"
ALERTMANAGER_ARCHIVE="alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz"
ALERTMANAGER_URL="https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/${ALERTMANAGER_ARCHIVE}"

# Download
[ -e "${ALERTMANAGER_ARCHIVE}" ] || curl -sL "${ALERTMANAGER_URL}" -o "${ALERTMANAGER_ARCHIVE}"
[ -d "alertmanager" ] || (tar xzf "${ALERTMANAGER_ARCHIVE}" ; mv "alertmanager-${ALERTMANAGER_VERSION}.linux-amd64" "alertmanager")

Launch alertmanager

➜ cat <<EOF > alertmanager.yml 
route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'mytest'
receivers:
  - name: 'mytest'
EOF
./alertmanager/alertmanager --version > version.txt
./alertmanager/alertmanager

Alertmanager is working : http://localhost:9093

Harass alertmanager with alerts and silences

while true; do 
  date; 
  ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ alert add alertname=foo node=bar
  ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ silence add -a "awesome user" -d 1m -c "awesome comment" alertname=foo
  sleep 60
done

The first amtool call will send an alert to Alertmanager just like Prometheus would do it.

The second amtool call will set a silence on that alert just like kthxbye would do it.

Difference with Prometheus/Kthxbye : we harass Alertmanager every 1 minute.

go tool pprof

DATE=$(date +'%Y%m%d_%H%M%S')
FILEPREFIX="profile-${DATE}"

go tool pprof -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs.pdf"

We can also play with go tool pprof -http localhost:12345 http://localhost:9093/debug/pprof/allocs

go tool pprof (since 29/08/2023)

➜ cat <<'EOF' > trace.sh

DATE=$(date +'%Y%m%d_%H%M%S')
REPORTDIR="profiles"
mkdir -p "${REPORTDIR}"
FILEPREFIX="${REPORTDIR}/profile-${DATE}"
export PPROF_TMPDIR="$(mktemp -d /tmp/pprof-XXXXXX)"
echo "Traces in '${PPROF_TMPDIR}'"

echo "=======================================" >> "${REPORTDIR}/report.txt"
echo "${DATE}" >> "${REPORTDIR}/report.txt"
./alertmanager/alertmanager --version >> "${REPORTDIR}/report.txt"
go tool pprof -raw http://localhost:9093/debug/pprof/allocs > /dev/null
mv "${PPROF_TMPDIR}/pprof.alertmanager.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz" "${FILEPREFIX}.pb.gz"
go tool pprof -sample_index=alloc_objects -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=alloc_space -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=inuse_objects -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"
go tool pprof -sample_index=inuse_space -text "${FILEPREFIX}.pb.gz" | grep "^Showing" >> "${REPORTDIR}/report.txt"

rmdir "${PPROF_TMPDIR}"
EOF

Then run "./trace.sh" from times to times

ngc104 avatar Jan 13 '22 16:01 ngc104

pprof profiles.pdf

I'm back on this issue. I harassed Alertmanager for a workday.

Here are some pdf (I don't know if it is the preferred format, sorry)

09h19 profile-20220113_091951-allocs.pdf 10h59 profile-20220113_105909-allocs.pdf 16h55 profile-20220113_165542-allocs.pdf

09h19 4Mb (nodes accounting for 33245.19kB, 78.29% of 42466.72kB total) 10h59 75Mb (nodes accounting for 56.61MB, 75.30% of 75.18MB total) 16h55 173Mb (nodes accounting for 134.34MB, 77.30% of 173.80MB total)

Because I can now reproduce the bug easily, please tell me what kind of pprof traces you need to investigate.

ngc104 avatar Jan 13 '22 16:01 ngc104

Hello @roidelapluie I'm back on this issue (replacing @Winael ). If you can come back on it too (or anybody else), please tell me how to take pprof traces if you need more infos.

Note : I took heap traces too. All the day. On the PDF, I can read Showing nodes accounting for 3609.71kB, 100% of 3609.71kB total at the beginning and at the end of the day. Heap seems to be stable.

ngc104 avatar Jan 13 '22 16:01 ngc104

I was going through the profiles and notice that you are using alloc_space which is about how much memory space was allocated. Can you inuse_space and check if the memory consumption is going up.

prashbnair avatar Feb 15 '22 10:02 prashbnair

@prashbnair not sure to understand... You want me to run go tool pprof -sample_index=inuse_space instead of just go tool pprof ?

I will try that ASAP.

Are PDF files the best output for you ?

ngc104 avatar Feb 16 '22 17:02 ngc104

Starting taking the traces. They will be posted later today.

Here is my script to take traces :

#! /bin/bash

DATE=$(date +'%Y%m%d_%H%M%S')
FILEPREFIX="profile-${DATE}"

go tool pprof -sample_index=inuse_space -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-inuse_space.pdf"

go tool pprof -sample_index=inuse_objects -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-inuse_objects.pdf"

go tool pprof -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-alloc_space.pdf"

go tool pprof -sample_index=alloc_objects -pdf http://localhost:9093/debug/pprof/allocs
mv profile001.pdf "${FILEPREFIX}-allocs-alloc_objects.pdf"

ngc104 avatar Feb 18 '22 08:02 ngc104

pprof more profiles.pdf

I restarted my test (as described in a previous comment above).

I took some traces right after I started Alertmanager and my loop for the alert, at 09:19. I took other traces after lunch at 14:02. And I ended my test at 17:58.

09:19

14:02

17:58

I tried the 4 sample indexes I found for allocations. I hope that it is what you wanted @prashbnair .

cc @roidelapluie

ngc104 avatar Feb 18 '22 17:02 ngc104

@ngc104 By looking at the inuse_space I notice that it does not go up over time. Are you still able to reproduce the error?

prashbnair avatar Feb 21 '22 07:02 prashbnair

@prashbnair Yes.

In fact, the main problem is Kubernetes killing the Alertmanager pod container when going OOM because of the "memory" going higher than the resource limit.

What is the "memory" measured by Kubernetes to kill the container ? If this is not alloc_space, I don't know how to measure it with go pprof.

ngc104 avatar Feb 21 '22 11:02 ngc104

Restarted my test at 9h22 UTC. Here is some interesting result at 10h03 UTC...

➜ ./alertmanager/amtool --alertmanager.url=http://localhost:9093/ silence query --expired
ID                                    Matchers         Ends At                  Created By    Comment          
2fca38e1-9dc3-4fbf-8f35-291dcac02a34  alertname="foo"  2022-05-02 09:22:11 UTC  awesome user  awesome comment  
8ab24f44-c779-4fb7-8ba5-da9061b635d0  alertname="foo"  2022-05-02 09:23:11 UTC  awesome user  awesome comment  
7347ceff-527c-4106-8dc9-861d8205b374  alertname="foo"  2022-05-02 09:24:11 UTC  awesome user  awesome comment  
26299122-2d9c-44dc-bc46-fa945379f4d7  alertname="foo"  2022-05-02 09:25:11 UTC  awesome user  awesome comment  
7d27178d-59ad-443f-81dd-e9351b6a9701  alertname="foo"  2022-05-02 09:26:11 UTC  awesome user  awesome comment  
0219f67a-507a-4b4a-99e4-bf8020fe6615  alertname="foo"  2022-05-02 09:27:11 UTC  awesome user  awesome comment  
73173814-25d6-4181-8de7-0677b8011fca  alertname="foo"  2022-05-02 09:28:11 UTC  awesome user  awesome comment  
82cf631a-07c3-4f90-8131-e7d4ade49d35  alertname="foo"  2022-05-02 09:29:11 UTC  awesome user  awesome comment  
37978ad4-9c1f-4ad7-a6af-d75e33baf75b  alertname="foo"  2022-05-02 09:30:11 UTC  awesome user  awesome comment  
dc17d617-ac31-460a-949f-4021fd945374  alertname="foo"  2022-05-02 09:31:11 UTC  awesome user  awesome comment  
41f5a1b8-ce3a-4a85-a118-672b0dce2267  alertname="foo"  2022-05-02 09:32:11 UTC  awesome user  awesome comment  
d0e8b085-fe67-49dc-9ee2-96e6161a7d0d  alertname="foo"  2022-05-02 09:33:12 UTC  awesome user  awesome comment  
222bae9c-7b07-4a80-997b-06867411d6c8  alertname="foo"  2022-05-02 09:34:12 UTC  awesome user  awesome comment  
025cd910-b437-4455-b925-2744f7a5e027  alertname="foo"  2022-05-02 09:35:12 UTC  awesome user  awesome comment  
0f6dcfc7-8320-4bfb-aab5-fc79fe661546  alertname="foo"  2022-05-02 09:36:12 UTC  awesome user  awesome comment  
3ca66616-f985-4e7c-b60f-7526fa0e7ae6  alertname="foo"  2022-05-02 09:37:12 UTC  awesome user  awesome comment  
0c2ef5b5-5e82-4823-8453-e223f18289a7  alertname="foo"  2022-05-02 09:38:12 UTC  awesome user  awesome comment  
7abe6854-ab5e-4fc8-bd6e-a7495ef8b366  alertname="foo"  2022-05-02 09:39:12 UTC  awesome user  awesome comment  
41a55f9a-b330-494f-a842-8e97e740fb27  alertname="foo"  2022-05-02 09:40:12 UTC  awesome user  awesome comment  
8c361742-e1c0-44c3-8d7b-530771335458  alertname="foo"  2022-05-02 09:41:12 UTC  awesome user  awesome comment  
e95a41c6-91f8-45e2-9bcb-aa21d32fb60e  alertname="foo"  2022-05-02 09:42:12 UTC  awesome user  awesome comment  
6016bdfa-2638-47eb-81b6-502f2317a29f  alertname="foo"  2022-05-02 09:43:12 UTC  awesome user  awesome comment  
11bb8940-1865-4512-8dd2-248535f02997  alertname="foo"  2022-05-02 09:44:12 UTC  awesome user  awesome comment  
51853d4f-a9cc-450c-bc4c-84d40975be83  alertname="foo"  2022-05-02 09:45:12 UTC  awesome user  awesome comment  
200dcc2e-5b5d-4ed8-80cd-3b87c93d35f6  alertname="foo"  2022-05-02 09:46:12 UTC  awesome user  awesome comment  
fec736cf-ff1d-4a03-95e9-9615b834eba6  alertname="foo"  2022-05-02 09:47:12 UTC  awesome user  awesome comment  
1aac8602-7c30-4eda-a37b-a8e54bcf6252  alertname="foo"  2022-05-02 09:48:12 UTC  awesome user  awesome comment  
4b3abc68-5621-400f-9c56-65b05980c568  alertname="foo"  2022-05-02 09:49:12 UTC  awesome user  awesome comment  
1a3276b4-1f61-4df4-8baf-a98906b972f5  alertname="foo"  2022-05-02 09:50:12 UTC  awesome user  awesome comment  
b06f383d-a4e5-45ae-bf09-0b7cb32a2665  alertname="foo"  2022-05-02 09:51:12 UTC  awesome user  awesome comment  
9e6a8329-e395-4578-909e-111f22c9cb8f  alertname="foo"  2022-05-02 09:52:13 UTC  awesome user  awesome comment  
ebcd1791-cb78-48d6-bd46-db768cb42476  alertname="foo"  2022-05-02 09:53:13 UTC  awesome user  awesome comment  
c0f29585-db7a-4560-90d2-825fc99b819a  alertname="foo"  2022-05-02 09:54:13 UTC  awesome user  awesome comment  
0efb0aca-4471-41d7-8daf-a9be279bac23  alertname="foo"  2022-05-02 09:55:13 UTC  awesome user  awesome comment  
a6d2db55-c5c8-4e6e-9eed-e1ec87da51e5  alertname="foo"  2022-05-02 09:56:13 UTC  awesome user  awesome comment  
90bc593e-0518-479a-8c02-4072bc0c4421  alertname="foo"  2022-05-02 09:57:13 UTC  awesome user  awesome comment  
49c4bc3b-57a3-4d6e-8bd8-8166addb18fa  alertname="foo"  2022-05-02 09:58:13 UTC  awesome user  awesome comment  
971734d2-c8ad-40bd-a27c-95c592d65efc  alertname="foo"  2022-05-02 09:59:13 UTC  awesome user  awesome comment  
c773ba25-af00-4045-9a70-b63e75ecd5a8  alertname="foo"  2022-05-02 10:00:13 UTC  awesome user  awesome comment  
fb8f23e9-32ad-4ab4-b449-cf86e34c5917  alertname="foo"  2022-05-02 10:01:13 UTC  awesome user  awesome comment  
4e95d124-d052-403d-b5e6-86d0acc22c17  alertname="foo"  2022-05-02 10:02:13 UTC  awesome user  awesome comment  

Could the "memory leak" be something else than a memory leak ? Could it be the sum of expired silences not being flushed ?

ngc104 avatar May 02 '22 10:05 ngc104

I was bad on this one... I was exploring data.retention in the Production environment. We have data.retention unset (e.g. using the default value 120h / 5 days).

With amtool, I notice that I have a lot of silences. But no silence after 5 days. There is no alerts except the 2 or 3 active and suppressed ones (3 alerts won't create a memory leak, right ?)

When I restart Alertmanager, I expect to have a memory consumption that goes continuously growing for 5 days. Then it should become flat. But I observe that it continues to grow after 5 days. So my guess was bad : this is not the sum of expired silences being flushed.

I'm thinking again about a memory leak. But no idea on how to debug it.

ngc104 avatar May 13 '22 15:05 ngc104

Hello and happy new year !

Alertmanager-0.25.0 is released. Let's give a try... Same procedure as above (in a previous comment)

12h37 profile-20230106_155103-allocs.pdf 14h06 profile-20230106_140612-allocs.pdf 16h55 profile-20230106_123718-allocs.pdf

12h37 14Mb (nodes accounting for 14467.52kB, 100% of 14467.52kB total) 14h06 43Mb (nodes accounting for 27635.94kB, 63.32% of 43644.80kB total) 16h55 66Mb (nodes accounting for 43.69MB, 65.99% of 66.21MB total)

Nothing seems to have changed in 0.25.0 related to this issue.

ngc104 avatar Jan 06 '23 15:01 ngc104

Hello,

I had not noticed that I was running a standalone alertmanager in cluster mode. In order to remove any noise in the investigation, I retried with this command line :

./alertmanager/alertmanager --cluster.listen-address=""

Then I took some traces as usual :

10h56 : profile-20230113_105632-allocs.pdf 11h34 : profile-20230113_113413-allocs.pdf 14h01 : profile-20230113_140152-allocs.pdf 14h44 : profile-20230113_144421-allocs.pdf 18h58 : profile-20230113_185807-allocs.pdf

10h56 : 15Mb (nodes accounting for 15216.73kB, 100% of 15216.73kB total) 11h34 : 23Mb (nodes accounting for 17373.59kB, 73.87% of 23519.01kB total) 14h01 : 47Mb (nodes accounting for 34402.62kB, 71.93% of 47825.89kB total) 14h44 : 56Mb (nodes accounting for 40.78MB, 72.40% of 56.33MB total) 18h58 : 99Mb (Showing nodes accounting for 72MB, 72.58% of 99.20MB total)

I hope these traces are easier to use without the noise of the data structures needed for the absent cluster.

ngc104 avatar Jan 13 '23 18:01 ngc104

:wave: it would help if you could share the raw pprof flies.

simonpasquier avatar Jan 16 '23 08:01 simonpasquier

Hello,

Sorry for the delay... Too much work besides this incident. And I'm not sure that the "raw" pprof files are obtained just with -raw option instead of -pdf. Please give me more info if the following does not fit.

And I forgot to start Alertmanager with --cluster.listen-address="". So this one was a mono-cluster...

Here are the logs : profiles.zip

Here is my script to generate them :

#! /bin/bash

DATE=$(date +'%Y%m%d_%H%M%S')
mkdir -p profiles
FILEPREFIX="profiles/profile-${DATE}"

go tool pprof -sample_index=inuse_space -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-inuse_space.raw"
go tool pprof -sample_index=inuse_objects -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-inuse_objects.raw"
go tool pprof -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-alloc_space.raw"
go tool pprof -sample_index=alloc_objects -raw http://localhost:9093/debug/pprof/allocs > "${FILEPREFIX}-allocs-alloc_objects.raw"

Edit : I also upload binary files (I have no idea how they work, but I guess you like them too) : pprof.zip

cc @simonpasquier

ngc104 avatar Jun 13 '23 15:06 ngc104

I missed the fact that you're using version 0.21.0. Can you try with v0.25.0?

simonpasquier avatar Aug 04 '23 13:08 simonpasquier

All tests were done with latest version.

In 2021 when I created the issue it was tested with 0.21.0. Latest test in 2023 (june 13th) was with 0.25.0.

Every time a new version was released, I tested again with the hope it was solved. But with 0.25 it is not solved yet.

ngc104 avatar Aug 04 '23 14:08 ngc104

Hello,

Alertmanager 0.26 is released :partying_face:

The bug is still there :bomb:

I modified my comment above with the procedure to reproduce the bug (I changed the way to run go tool pprof).

And here are new profiles.zip files.

Below is an extract of the report.txt file that I added in the zip file :

alertmanager, version 0.26.0 (branch: HEAD, revision: d7b4f0c7322e7151d6e3b1e31cbc15361e295d8d)
  build user:       root@df8d7debeef4
  build date:       20230824-11:11:58
  go version:       go1.20.7
  platform:         linux/amd64
  tags:             netgo
...
=======================================
20230829_114408
Showing nodes accounting for 155789, 99.48% of 156600 total
Showing nodes accounting for 34175.86kB, 100% of 34175.86kB total
=======================================
20230829_124408
Showing nodes accounting for 293671, 98.51% of 298100 total
Showing nodes accounting for 52.35MB, 100% of 52.35MB total
=======================================
20230829_134408
Showing nodes accounting for 376318, 96.80% of 388746 total
Showing nodes accounting for 63.89MB, 100% of 63.89MB total
=======================================
20230829_144408
Showing nodes accounting for 432682, 95.95% of 450947 total
Showing nodes accounting for 81.63MB, 100% of 81.63MB total
=======================================
20230829_154408
Showing nodes accounting for 492675, 94.88% of 519250 total
Showing nodes accounting for 94.59MB, 100% of 94.59MB total
=======================================
20230829_164408
Showing nodes accounting for 666111, 94.82% of 702485 total
Showing nodes accounting for 98.54MB, 86.75% of 113.58MB total
=======================================
20230829_174409
Showing nodes accounting for 734087, 95.01% of 772671 total
Showing nodes accounting for 119.25MB, 89.80% of 132.79MB total
=======================================
20230829_184409
Showing nodes accounting for 819591, 94.48% of 867496 total
Showing nodes accounting for 136.83MB, 91.30% of 149.87MB total

As you can see something is growing... (dates are in format YYYYMMDD_HHMMSS)

cc @simonpasquier

ngc104 avatar Aug 29 '23 17:08 ngc104

Hi! 👋 I attempted to reproduce this on Alertmanager 0.26.0 using the same script but so far I have not been able to. I ran the script for 1h:

#!/bin/sh

while true; do
  ./amtool --alertmanager.url=http://127.0.0.1:9093 alert add foo=bar
  ./amtool --alertmanager.url=http://127.0.0.1:9093 silence add foo=bar -c="" -d 1m
  sleep 5
done

Here is go_memstats_heap_objects and go_memstats_heap_inuse_bytes:

Screenshot 2023-11-19 at 1 05 30 pm Screenshot 2023-11-19 at 1 06 03 pm

grobinson-grafana avatar Nov 19 '23 13:11 grobinson-grafana