netbird
netbird copied to clipboard
Suspected Memory Leak in Management service
Describe the problem
After running for a few days (or less) on a busy host the management service will gobble up all the available RAM. A restart of the docker container will free the RAM only to start the process again.
To Reproduce
Steps to reproduce the behavior:
- Run a busy netbird self-hosted deployment
- wait
- see that all RAM is used up and restart the container
- repeat
Expected behavior Memory is freed up as server usage diminishes.
Are you using NetBird Cloud?
self-hosted docker
NetBird version
0.26.6 & 0.26.7
Screenshots
Here's the RAM usage graph of the last 24hrs where I restarted the management service last night when I got worried that it would crash again while I was sleeping.
Hello!
Are you using Zitadel with Cacroachdb?
Negative, using Authentik.
Thank you! What means "busy"? How many users/peers do you have?
385 peers
@TSJasonH can you confirm which container is generating the memory consumption?
You can share the output of:
docker stats
Sure, most memory is consumed by management and signal. The difference is that over time the management container will keep gobbling more and more, whereas signal stays fairly well constrained.
The docker stats is a little misleading right now because it's 7:15am and my cron to restart the management container ran at 2am. The heavy peer load starts around 8am. There are only about 65 peers connected at the moment.
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
36a69c6aecca artifacts-management-1 0.00% 691.4MiB / 15.61GiB 4.33% 254MB / 13.6GB 0B / 58.3GB 22
190365600caa artifacts-dashboard-1 0.03% 35.42MiB / 15.61GiB 0.22% 8.14MB / 101MB 254MB / 64.6MB 19
1c6964d0aa55 artifacts-signal-1 17.53% 751MiB / 15.61GiB 4.70% 13.3GB / 6GB 0B / 0B 22
9b323ba0d1ae artifacts-coturn-1 0.51% 193.4MiB / 15.61GiB 1.21% 0B / 0B 0B / 0B 99
It's probably more evident from the RAM graph that shows the automatic 2am restarts of just the management container.
Thanks for sharing the stats and the graphs again.
Can you run it again around 12 PM or 4 PM? We should see a better number there.
I ended up having to restart the management container already today, so I grabbed a docker stats before doing so.
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
36a69c6aecca artifacts-management-1 189.88% 1.262GiB / 15.61GiB 8.08% 792MB / 41.9GB 0B / 110GB 24
190365600caa artifacts-dashboard-1 0.04% 35.44MiB / 15.61GiB 0.22% 8.67MB / 111MB 254MB / 64.6MB 19
1c6964d0aa55 artifacts-signal-1 19.21% 751.1MiB / 15.61GiB 4.70% 14.3GB / 6.53GB 0B / 0B 22
9b323ba0d1ae artifacts-coturn-1 0.30% 207.1MiB / 15.61GiB 1.30% 0B / 0B 0B / 0B 99
after the restart:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
36a69c6aecca artifacts-management-1 0.08% 357.8MiB / 15.61GiB 2.24% 2.05MB / 19.8MB 0B / 71.3MB 34
190365600caa artifacts-dashboard-1 0.03% 35.43MiB / 15.61GiB 0.22% 8.68MB / 111MB 254MB / 64.6MB 19
1c6964d0aa55 artifacts-signal-1 25.11% 751.1MiB / 15.61GiB 4.70% 14.3GB / 6.54GB 0B / 0B 22
9b323ba0d1ae artifacts-coturn-1 0.74% 216.4MiB / 15.61GiB 1.35% 0B / 0B 0B / 0B 99
Users were having trouble connecting, the dashboard wasn't finishing a load before trying to auto-refresh and the logs were filling with these types of messages:
"log":"2024-04-09T13:19:04Z WARN management/server/grpcserver.go:376: failed logging in peer Ej8E5w2a2/LkDrz8VK6doA04AuvIsQzAZbP6v4O05Go=\n","stream":"stderr","time":"2024-04-09T13:19:04.035874189Z"}
{"log":"2024-04-09T13:19:08Z WARN management/server/grpcserver.go:376: failed logging in peer 9mDnkjOBvl4LDogzZCg4El5b8divXv7F88/9q131FjY=\n","stream":"stderr","time":"2024-04-09T13:19:08.932876892Z"}
{"log":"2024-04-09T13:19:52Z WARN management/server/grpcserver.go:376: failed logging in peer Z3K6D5JHntkEsb27SZvJZghyK2eQQnptfrPL0FJRTwM=\n","stream":"stderr","time":"2024-04-09T13:19:52.301454042Z"}
{"log":"2024-04-09T13:19:59Z WARN management/server/grpcserver.go:376: failed logging in peer xuKFibw2/OSVhoufYdpA1kqHmWnG4PfmGRZXIejsOxc=\n","stream":"stderr","time":"2024-04-09T13:19:59.699700349Z"}
@mlsmaycon thanks for the troubleshooting chat in slack.
After your suggestion to revert from sqlite back to the json store, things have been working smoothly. I disabled the nightly mgmt container restart to see what happened overnight, and indeed the memory usage has been holding very steady.
Latest docker stats:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
5ba38625f571 artifacts-management-1 122.27% 859MiB / 15.61GiB 5.37% 677MB / 35.6GB 0B / 19.2MB 23
29ec4549d6de artifacts-signal-1 31.19% 196.3MiB / 15.61GiB 1.23% 4.78GB / 2.28GB 0B / 0B 21
a1bdd6223fa2 artifacts-coturn-1 2.66% 205MiB / 15.61GiB 1.28% 0B / 0B 0B / 0B 99
3c3b5bce7164 artifacts-dashboard-1 0.04% 34.49MiB / 15.61GiB 0.22% 3MB / 38MB 0B / 64.6MB 19
Memory Graph:
This was resolved with the changes in 0.28.x