salt
salt copied to clipboard
[BUG] Memory Leak in EventPublisher process
Description
We've observed a memory leak in the EventPubisher process on the master. This can be seen by adding a simple engine to a salt-minion that sends lots of auth requests. It looks like the leak isn't specific to auth events. A large number of any events should trigger it.
Setup
import salt.crypt
import salt.ext.tornado.ioloop
import salt.ext.tornado.gen
import logging
log = logging.getLogger(__name__)
@salt.ext.tornado.gen.coroutine
def do_auth(opts, io_loop):
while True:
auth = salt.crypt.AsyncAuth(opts, io_loop=io_loop)
log.info("ENGINE DO AUTH")
yield auth.sign_in()
def start():
io_loop = salt.ext.tornado.ioloop.IOLoop()
__opts__['master_uri'] = 'tcp://127.0.0.1:4506'
io_loop.spawn_callback(do_auth, __opts__, io_loop)
io_loop.start()
Versions
v3004
Does the number of open file descriptors appear to increase too? I'm wondering if this is related to #61521
Here is a patch that will address the issue for 3004. I've tested this against the master branch and I don't see the memory leak. Looks like the recent transport refactor inadvertently addressed the issue.
0001-Slow-memory-leak-fix.patch.txt
@frebib I don't believe this is the same issue as #61521. I believe the root of #61521 is caused by this commit. We're probably creating multiple instances of events and transports which needs to be addressed. #61468 should be a step in the right direction.
is there any progress on this ? `Salt Version: Salt: 3006.4
Python Version: Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]
Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: Not Installed smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4
System Versions: dist: alpine 3.14.6 locale: utf-8 machine: x86_64 release: 5.15.0-87-generic system: Linux version: Alpine Linux 3.14.6 `
it looks a lot better in 3006.5 when using this https://github.com/saltstack/salt/commit/af12352cba4a4ecd3859addbe21ff7169546fc9c solution from minon.py will be observing over the Christmas holidays
it looks a lot better in 3006.5 when using this af12352 solution from minon.py will be observing over the Christmas holidays
Fantastic news, thanks for the update.
How to debug a memory leak like this? tracemalloc or some other tools?
@max-arnold
In the past we've relied on the __del__
to clean things like this up. That is generally considered an anti-pattern in python and we've been working towards cleaning that practice up. Most recently I've started adding warnings if an object is getting garbage collected without being properly closed (https://github.com/saltstack/salt/pull/65559). Several places (now fixed) where transport client's were not being closed have been revealed in our test suite. With this code in place it should be easier for users to identify these kinds of issues an report them with useful debug info.
We've also been working towards better debugging of running salt processes with better tooling. We've added debug symbol packages in 3006.x and there is a newer tool relenv-gdb-dbg to help debug these kinds of issues.
looks a lot better when i review Eventpublisher
@dwoz i needed to do a flip on this row in the minion.py code
if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly
@zpell82 sounds like a bug which should be an issue ?
@dwoz i needed to do a flip on this row in the minion.py code
if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly
@Zpell82 Can you open a separate issue for this?
We are running salt 3006.6 and are seeing this memory leak. I examined salt/minion.py and it already has the fix suggested in https://github.com/saltstack/salt/issues/61565#issuecomment-1867502647
Further, after the upgrade from 3006.5 to 3006.6 we are now seeing this problem present much faster than before. It used to take about 20 days, now we’ve noticed it after just 7 days.
Yeah, seeing this in 3006.6 in testing as well.
I believe this is resolved with 3006.7
can folks confirm?
I attempted to install 3006.7, but I got an error while trying to salt-pip install pygit2 to use with a git backend.
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'readelf'
OS is Debian Bullseye
Maybe a new dependency needs to be added to the deb?
I still see 3006.7's leaking in our environment...EventPublisher process is over 8GB memory in ~48 hours since upgrading (edited from originally saying "less than 24 hours", I lost track of what day it was /facepalm)
A master still at 3006.4 EventPublisher up to 24GB in about 10 days since its last restart
These masters are both using an external job cache. Masters not using an external job cache don't seem to leak noticeably - are others seeing the leak in 3006.6/etc using an external job cache? Maybe the external job cache is a red herring, I haven't dug in to all that is all glued together and if that is handled by the EventPublisher process...
@lomeroe as a matter of fact, ours is using an external job cache (redis).
interesting...maybe there is something there with the external job cache and the EventPublisher process leaking (we are not using redis as our external job cache, so it would seem to not specifically be related to a single external job cache type at least)
@dwoz - thoughts?
This is still a problem in 3006.7 for us, and we have a cronjob to restart Salt master once a day. We are not using external job cache.
@johje349 @jheiselman - do either of you run orchestrations or other background jobs that utilize batch mode? After restarting a master with the patch I mention in #66249, memory for the EventPublisher process hasn't gone over 400MB in almost 24 hours, which seems considerably better from what we have been seeing - typically it is several GB in a day. Obviously need more time monitoring to really tell if it has something to do with it, so it could just be coincidental....
We've got quite a few orchestrations/api initiated jobs that use batch mode (all initiated on the masters exhibiting the issue), so if they were hanging up/never fully returning, I suppose it's possible that could cause memory issues in the EventPublisher process somehow. I wouldn't have ever guessed that, but...
We do not have very many, but yes, we do have a few scheduled jobs (started via the salt-api) that utilize batch mode. None of our orchestrations use batch mode.
At this point in time, both of our salt-masters were last restarted four days ago. One of our masters is currently using 3.5 GB of memory and less than 1 GB on another. They share the same workloads/minions. There's no difference between the two. Their salt-api's are behind a load balancer, so some one of them may be getting heavier jobs than the other purely by chance. By the load balancer is configured for a simple round-robin so they should be getting the same number of jobs.
Yes we use orchestration states with batch set to 20%.
After 5 days, EventPublisher memory usage is still hovering around 400MB. The only change made was applying the patch mentioned in #66249. Seems fairly likely to me that issue causes memory usage increases in the EventPublisher process due to the jobs run in batch mode never actually ending.
@dwoz i needed to do a flip on this row in the minion.py code
if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly
@Zpell82 Can you open a separate issue for this?
Found the issue , it was that i hade my master : ["mastername"] in minion config , removed the brakets and it started working just fine
@lomeroe's patch in #66249 worked for us in eliminating the memory leak we were observing in EventPublisher. We don't rely on orchestration, external job caches, or salt-api but do use batches. Thanks for the patch!