netbird Suspected Memory Leak in Management service

Describe the problem

After running for a few days (or less) on a busy host the management service will gobble up all the available RAM. A restart of the docker container will free the RAM only to start the process again.

To Reproduce

Steps to reproduce the behavior:

Run a busy netbird self-hosted deployment
wait
see that all RAM is used up and restart the container
repeat

Expected behavior Memory is freed up as server usage diminishes.

Are you using NetBird Cloud?

self-hosted docker

NetBird version

0.26.6 & 0.26.7

Screenshots

Here's the RAM usage graph of the last 24hrs where I restarted the management service last night when I got worried that it would crash again while I was sleeping.

Apr 04 '24 10:04 TSJasonH

Hello!

Are you using Zitadel with Cacroachdb?

Apr 08 '24 13:04 pappz

Negative, using Authentik.

Apr 08 '24 14:04 TSJasonH

Thank you! What means "busy"? How many users/peers do you have?

Apr 09 '24 07:04 pappz

385 peers

Apr 09 '24 08:04 TSJasonH

@TSJasonH can you confirm which container is generating the memory consumption?

You can share the output of:

docker stats

Apr 09 '24 10:04 mlsmaycon

Sure, most memory is consumed by management and signal. The difference is that over time the management container will keep gobbling more and more, whereas signal stays fairly well constrained.

The docker stats is a little misleading right now because it's 7:15am and my cron to restart the management container ran at 2am. The heavy peer load starts around 8am. There are only about 65 peers connected at the moment.

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O        PIDS
36a69c6aecca   artifacts-management-1   0.00%     691.4MiB / 15.61GiB   4.33%     254MB / 13.6GB   0B / 58.3GB      22
190365600caa   artifacts-dashboard-1    0.03%     35.42MiB / 15.61GiB   0.22%     8.14MB / 101MB   254MB / 64.6MB   19
1c6964d0aa55   artifacts-signal-1       17.53%    751MiB / 15.61GiB     4.70%     13.3GB / 6GB     0B / 0B          22
9b323ba0d1ae   artifacts-coturn-1       0.51%     193.4MiB / 15.61GiB   1.21%     0B / 0B          0B / 0B          99

It's probably more evident from the RAM graph that shows the automatic 2am restarts of just the management container.

Apr 09 '24 11:04 TSJasonH

Thanks for sharing the stats and the graphs again.

Can you run it again around 12 PM or 4 PM? We should see a better number there.

Apr 09 '24 11:04 mlsmaycon

I ended up having to restart the management container already today, so I grabbed a docker stats before doing so.

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
36a69c6aecca   artifacts-management-1   189.88%   1.262GiB / 15.61GiB   8.08%     792MB / 41.9GB    0B / 110GB       24
190365600caa   artifacts-dashboard-1    0.04%     35.44MiB / 15.61GiB   0.22%     8.67MB / 111MB    254MB / 64.6MB   19
1c6964d0aa55   artifacts-signal-1       19.21%    751.1MiB / 15.61GiB   4.70%     14.3GB / 6.53GB   0B / 0B          22
9b323ba0d1ae   artifacts-coturn-1       0.30%     207.1MiB / 15.61GiB   1.30%     0B / 0B           0B / 0B          99

after the restart:

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O        PIDS
36a69c6aecca   artifacts-management-1   0.08%     357.8MiB / 15.61GiB   2.24%     2.05MB / 19.8MB   0B / 71.3MB      34
190365600caa   artifacts-dashboard-1    0.03%     35.43MiB / 15.61GiB   0.22%     8.68MB / 111MB    254MB / 64.6MB   19
1c6964d0aa55   artifacts-signal-1       25.11%    751.1MiB / 15.61GiB   4.70%     14.3GB / 6.54GB   0B / 0B          22
9b323ba0d1ae   artifacts-coturn-1       0.74%     216.4MiB / 15.61GiB   1.35%     0B / 0B           0B / 0B          99

Users were having trouble connecting, the dashboard wasn't finishing a load before trying to auto-refresh and the logs were filling with these types of messages:

"log":"2024-04-09T13:19:04Z WARN management/server/grpcserver.go:376: failed logging in peer Ej8E5w2a2/LkDrz8VK6doA04AuvIsQzAZbP6v4O05Go=\n","stream":"stderr","time":"2024-04-09T13:19:04.035874189Z"}
{"log":"2024-04-09T13:19:08Z WARN management/server/grpcserver.go:376: failed logging in peer 9mDnkjOBvl4LDogzZCg4El5b8divXv7F88/9q131FjY=\n","stream":"stderr","time":"2024-04-09T13:19:08.932876892Z"}
{"log":"2024-04-09T13:19:52Z WARN management/server/grpcserver.go:376: failed logging in peer Z3K6D5JHntkEsb27SZvJZghyK2eQQnptfrPL0FJRTwM=\n","stream":"stderr","time":"2024-04-09T13:19:52.301454042Z"}
{"log":"2024-04-09T13:19:59Z WARN management/server/grpcserver.go:376: failed logging in peer xuKFibw2/OSVhoufYdpA1kqHmWnG4PfmGRZXIejsOxc=\n","stream":"stderr","time":"2024-04-09T13:19:59.699700349Z"}

Apr 09 '24 13:04 TSJasonH

@mlsmaycon thanks for the troubleshooting chat in slack.

After your suggestion to revert from sqlite back to the json store, things have been working smoothly. I disabled the nightly mgmt container restart to see what happened overnight, and indeed the memory usage has been holding very steady.

Latest docker stats:

CONTAINER ID   NAME                     CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O     PIDS
5ba38625f571   artifacts-management-1   122.27%   859MiB / 15.61GiB     5.37%     677MB / 35.6GB    0B / 19.2MB   23
29ec4549d6de   artifacts-signal-1       31.19%    196.3MiB / 15.61GiB   1.23%     4.78GB / 2.28GB   0B / 0B       21
a1bdd6223fa2   artifacts-coturn-1       2.66%     205MiB / 15.61GiB     1.28%     0B / 0B           0B / 0B       99
3c3b5bce7164   artifacts-dashboard-1    0.04%     34.49MiB / 15.61GiB   0.22%     3MB / 38MB        0B / 64.6MB   19

Memory Graph:

Apr 10 '24 12:04 TSJasonH

This was resolved with the changes in 0.28.x

Jul 12 '24 14:07 TSJasonH

netbird netbird copied to clipboard

Suspected Memory Leak in Management service

netbird
netbird copied to clipboard