MeshAgent
MeshAgent copied to clipboard
Agent memory leak when mesh server unreachable
I manage an installation in which the MeshCentral server is run only when a remote machine needs to be accessed, and it is not uncommon for weeks or even months to pass between such occasions. As such, with the server unreachable for weeks or months (since it's not running), the agents repeatedly try to contact the server. Unfortunately, some of the agents leak memory (presumably) upon each attempt to contact the server, with the result that the agent eventually consumes all memory. I have seen an agent leak hundreds of megabytes in less than a day, and gigabytes within several days. The problem has been afflicting meshagent
installed on Ubuntu 20.04.
I was able to reproduce the problem easily using a virgin installation of Xubuntu 20.04 (with all software updates applied) in VirtualBox 6.1.26 with only VirtualBox Guest Additions installed.
% lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal
% uname -a
Linux xubuntu 5.11.0-37-generic #41~20.04.2-Ubuntu SMP Fri Sep 24 09:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
The agent begins leaking memory as soon as the MeshCentral server is stopped, presumably each time it tries to reconnect to the server. In the short time I monitored it closely, I saw leaks ranging in size from 153 to 628 bytes per connection attempt.
MeshCentral version is 0.9.28. Agent information which Bryan has requested in other similar bug reports:
> fdsnapshot
Chain Timeout: 120405 milliseconds
FD[13] (R: 0, W: 0, E: 0) => MeshServer_ControlChannel
FD[8] (R: 0, W: 0, E: 0) => Signal_Listener
FD[10] (R: 0, W: 0, E: 0) => ILibIPAddressMonitor
FD[12] (R: 0, W: 0, E: 0) => fs.watch(/var/run/utmp)
FD[14] (R: 0, W: 0, E: 0) => net.ipcServer
FD[16] (R: 0, W: 0, E: 0) => ILibWebRTC_stun_listener_ipv4
> timerinfo
Timer: 19.9 minutes (0x1198710) [setInterval(), meshcore (InfoUpdate Timer)]
> info
Current Core: Sep 10 2021, 3345885757
Agent Time: 2021-10-07 03:37:08.960-04:00.
User Rights: 0xffffffff.
Platform: linux.
Capabilities: 15.
Server URL: wss://[redacted]/agent.ashx.
OS: Ubuntu 20.04.3 LTS.
Modules: amt-apfclient, amt-lme, amt-manage, amt-mei, linux-dhcp, monitor-border, smbios, sysinfo, util-agentlog, wifi-scanner-windows, wifi-scanner, routeplus.
Server Connection: true, State: 1.
X11 support: true.
I have not been able to reproduce this problem with agents running on Windows or Artix Linux.
There have been other reports of agent memory leaks which have been fixed:
- https://github.com/Ylianst/MeshCentral/issues/2723#issuecomment-890169645
- https://github.com/Ylianst/MeshCentral/issues/2040#issuecomment-738343770
- https://github.com/Ylianst/MeshCentral/issues/2040#issuecomment-766290835
However, the one being reported here can be reproduced so easily with a virgin Ubuntu/Xubuntu installation that it seemed prudent to report it separately rather than mix it with the existing reports in which the conversations may have have gone off on different tangents.
A little while ago, I fixed a bug I found where the agent leaked a few bytes everytime it retried a connection to the server. May be the exact issue that was reported here... I'm trying to dig in the logs to see when that fix was added.
Hi Bryan,
In the links cited above, I did see mention of some leak fixes, however, the problem reported here is still present. To verify that the problem is ongoing, I followed the reproduction recipe given above and just now performed a clean Xubuntu 20.04 install as guest in a VirtualBox VM, and installed MeshAgent for Linux in the VM. As soon as I killed off the MeshCentral server, the agent began leaking memory, and continued leaking on each attempt to reestablish the connection to the server. The leak only stopped when I restarted the server and the agent was able to reconnect. (For what it's worth, the leaked memory in the agent remained leaked; it was never released.) This testing was performed with up-to-date MeshCentral and MeshAgent. As noted in the original report, I've only seen this leak on Ubuntu, which may provide a clue (or not).
MeshCentral version information:
> info
{
"meshVersion": "v1.0.33",
"nodeVersion": "v16.15.1",
"runMode": "WAN mode",
"productionMode": true,
"database": "NeDB",
"plugins": [],
"platform": "darwin",
"arch": "x64",
"pid": 26161,
"uptime": 608.131273148,
"cpuUsage": {
"user": 4090704,
"system": 453528
},
"memoryUsage": {
"rss": 133963776,
"heapTotal": 33112064,
"heapUsed": 31359408,
"external": 30218162,
"arrayBuffers": 29179698
},
"warnings": [],
"allDevGroupManagers": []
}
MeshAgent version information:
> info
Current Core: Apr 4 2022, 419748901
Agent Time: 2022-06-08 00:55:25.204-04:00.
User Rights: 0xffffffff.
Platform: linux.
Capabilities: 15.
Server URL: wss://[redacted]/agent.ashx.
OS: Ubuntu 20.04.4 LTS.
Modules: amt-apfclient, amt-lme, amt-manage, amt-mei, linux-dhcp, monitor-border, smbios, sysinfo, util-agentlog, wifi-scanner-windows, wifi-scanner.
Server Connection: true, State: 1.
X11 support: true.
For the record, I also just performed a clean install of Xubuntu 22.04 as guest in a VirtualBox VM and observe the same MeshAgent leak when the MeshCentral server becomes unreachable.
MeshAgent version information:
> info
Current Core: Apr 4 2022, 419748901
Agent Time: 2022-06-08 01:47:47.123-04:00.
User Rights: 0xffffffff.
Platform: linux.
Capabilities: 15.
Server URL: wss://[redacted]/agent.ashx.
OS: Ubuntu 22.04 LTS.
Modules: amt-apfclient, amt-lme, amt-manage, amt-mei, linux-dhcp, monitor-border, smbios, sysinfo, util-agentlog, wifi-scanner-windows, wifi-scanner.
Server Connection: true, State: 1.
X11 support: true.
I'm working on a test script that tests a few things with the control channel, so I will certainly look into this scenario.
Thanks for investigating!
Can confirm this is happening on Ubuntu 18.04 for me. Gradually, swap space is dominated by meshagent.
Are you seeing this when the agent is connected, or disconnected?
Are you seeing this when the agent is connected, or disconnected?
Happens both on connected and disconnected agents. It creeps slowly takes about 2weeks to take over 8GB of vmem
Do you have a reverse proxy? I was helping someone else and found that their HAproxy was set with a 60 second idle timeout, but the mesh server was configured with the default 120 second idle timeout. I found a bug in the agent that was leaking a small amount of memory on reconnect. I fixed that issue, but it will require Ylian to update the agent when he gets back. In the meantime, we were able to get the agent to not leak by configuring his reverse proxy for a 2 minute idle timeout, so that way the agent wasn't periodically disconnecting and reconnecting.
Do you have a reverse proxy? I was helping someone else and found that their HAproxy was set with a 60 second idle timeout, but the mesh server was configured with the default 120 second idle timeout. I found a bug in the agent that was leaking a small amount of memory on reconnect. I fixed that issue, but it will require Ylian to update the agent when he gets back. In the meantime, we were able to get the agent to not leak by configuring his reverse proxy for a 2 minute idle timeout, so that way the agent wasn't periodically disconnecting and reconnecting.
Yes! I do!
I run caddy as reverse proxy and use Cloudflare as another proxy layer (orange cloud enabled). This might be a similar situation.
Is there anything I can investigate on my setup that would help you guys?
@krayon007
I spun up an aws ec2 free tier on Ubuntu 22.04. Installed meshagent and telegraf monitoring. This graph shows precisely the memory creep:
Note that it goes on until it takes over the instance memory (1GB) and the instance crashes. After I rebooted it, it started creeping again.
I see this behavior on multiple bare metal and virtualized Ubuntu Server hosts.
All those instances connect to my reverse proxied server (caddy) behind cloudflare.
Issue is real and occurs on latest version.