salt [BUG] Memory Leak in EventPublisher process

Description

We've observed a memory leak in the EventPubisher process on the master. This can be seen by adding a simple engine to a salt-minion that sends lots of auth requests. It looks like the leak isn't specific to auth events. A large number of any events should trigger it.

Setup

import salt.crypt
import salt.ext.tornado.ioloop
import salt.ext.tornado.gen
import logging

log = logging.getLogger(__name__)


@salt.ext.tornado.gen.coroutine
def do_auth(opts, io_loop):
    while True:
        auth = salt.crypt.AsyncAuth(opts, io_loop=io_loop)
        log.info("ENGINE DO AUTH")
        yield auth.sign_in()


def start():
    io_loop = salt.ext.tornado.ioloop.IOLoop()
    __opts__['master_uri'] = 'tcp://127.0.0.1:4506'
    io_loop.spawn_callback(do_auth, __opts__, io_loop)
    io_loop.start()

Versions

v3004

Feb 01 '22 17:02 dwoz

Does the number of open file descriptors appear to increase too? I'm wondering if this is related to #61521

Feb 02 '22 12:02 frebib

Here is a patch that will address the issue for 3004. I've tested this against the master branch and I don't see the memory leak. Looks like the recent transport refactor inadvertently addressed the issue.

0001-Slow-memory-leak-fix.patch.txt

@frebib I don't believe this is the same issue as #61521. I believe the root of #61521 is caused by this commit. We're probably creating multiple instances of events and transports which needs to be addressed. #61468 should be a step in the right direction.

Feb 02 '22 21:02 dwoz

is there any progress on this ? `Salt Version: Salt: 3006.4

Python Version: Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]

Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.2 libgit2: Not Installed looseversion: 1.0.2 M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed packaging: 22.0 pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed python-gnupg: 0.4.8 PyYAML: 6.0.1 PyZMQ: 23.2.0 relenv: Not Installed smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4

System Versions: dist: alpine 3.14.6 locale: utf-8 machine: x86_64 release: 5.15.0-87-generic system: Linux version: Alpine Linux 3.14.6 `

Nov 27 '23 09:11 Zpell82

it looks a lot better in 3006.5 when using this https://github.com/saltstack/salt/commit/af12352cba4a4ecd3859addbe21ff7169546fc9c solution from minon.py will be observing over the Christmas holidays

Dec 22 '23 10:12 Zpell82

it looks a lot better in 3006.5 when using this af12352 solution from minon.py will be observing over the Christmas holidays

Fantastic news, thanks for the update.

Dec 28 '23 00:12 dwoz

How to debug a memory leak like this? tracemalloc or some other tools?

Dec 28 '23 02:12 max-arnold

@max-arnold

In the past we've relied on the __del__ to clean things like this up. That is generally considered an anti-pattern in python and we've been working towards cleaning that practice up. Most recently I've started adding warnings if an object is getting garbage collected without being properly closed (https://github.com/saltstack/salt/pull/65559). Several places (now fixed) where transport client's were not being closed have been revealed in our test suite. With this code in place it should be easier for users to identify these kinds of issues an report them with useful debug info.

We've also been working towards better debugging of running salt processes with better tooling. We've added debug symbol packages in 3006.x and there is a newer tool relenv-gdb-dbg to help debug these kinds of issues.

Dec 29 '23 09:12 dwoz

looks a lot better when i review Eventpublisher

Jan 03 '24 07:01 Zpell82

@dwoz i needed to do a flip on this row in the minion.py code if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly

Jan 05 '24 09:01 Zpell82

@zpell82 sounds like a bug which should be an issue ?

Jan 13 '24 11:01 network-shark

@dwoz i needed to do a flip on this row in the minion.py code if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly

@Zpell82 Can you open a separate issue for this?

Feb 13 '24 22:02 dwoz

We are running salt 3006.6 and are seeing this memory leak. I examined salt/minion.py and it already has the fix suggested in https://github.com/saltstack/salt/issues/61565#issuecomment-1867502647

Further, after the upgrade from 3006.5 to 3006.6 we are now seeing this problem present much faster than before. It used to take about 20 days, now we’ve noticed it after just 7 days.

Feb 16 '24 16:02 jheiselman

Yeah, seeing this in 3006.6 in testing as well.

Feb 20 '24 21:02 aalewis

I believe this is resolved with 3006.7 can folks confirm?

Mar 02 '24 07:03 dwoz

I attempted to install 3006.7, but I got an error while trying to salt-pip install pygit2 to use with a git backend.

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'readelf'

OS is Debian Bullseye

Maybe a new dependency needs to be added to the deb?

Mar 06 '24 14:03 jheiselman

I still see 3006.7's leaking in our environment...EventPublisher process is over 8GB memory in ~48 hours since upgrading (edited from originally saying "less than 24 hours", I lost track of what day it was /facepalm)

A master still at 3006.4 EventPublisher up to 24GB in about 10 days since its last restart

These masters are both using an external job cache. Masters not using an external job cache don't seem to leak noticeably - are others seeing the leak in 3006.6/etc using an external job cache? Maybe the external job cache is a red herring, I haven't dug in to all that is all glued together and if that is handled by the EventPublisher process...

Mar 10 '24 19:03 lomeroe

@lomeroe as a matter of fact, ours is using an external job cache (redis).

Mar 10 '24 19:03 jheiselman

interesting...maybe there is something there with the external job cache and the EventPublisher process leaking (we are not using redis as our external job cache, so it would seem to not specifically be related to a single external job cache type at least)

@dwoz - thoughts?

Mar 11 '24 13:03 lomeroe

This is still a problem in 3006.7 for us, and we have a cronjob to restart Salt master once a day. We are not using external job cache.

Mar 22 '24 10:03 johje349

@johje349 @jheiselman - do either of you run orchestrations or other background jobs that utilize batch mode? After restarting a master with the patch I mention in #66249, memory for the EventPublisher process hasn't gone over 400MB in almost 24 hours, which seems considerably better from what we have been seeing - typically it is several GB in a day. Obviously need more time monitoring to really tell if it has something to do with it, so it could just be coincidental....

We've got quite a few orchestrations/api initiated jobs that use batch mode (all initiated on the masters exhibiting the issue), so if they were hanging up/never fully returning, I suppose it's possible that could cause memory issues in the EventPublisher process somehow. I wouldn't have ever guessed that, but...

Mar 22 '24 20:03 lomeroe

We do not have very many, but yes, we do have a few scheduled jobs (started via the salt-api) that utilize batch mode. None of our orchestrations use batch mode.

At this point in time, both of our salt-masters were last restarted four days ago. One of our masters is currently using 3.5 GB of memory and less than 1 GB on another. They share the same workloads/minions. There's no difference between the two. Their salt-api's are behind a load balancer, so some one of them may be getting heavier jobs than the other purely by chance. By the load balancer is configured for a simple round-robin so they should be getting the same number of jobs.

Mar 22 '24 20:03 jheiselman

Yes we use orchestration states with batch set to 20%.

Mar 25 '24 12:03 johje349

After 5 days, EventPublisher memory usage is still hovering around 400MB. The only change made was applying the patch mentioned in #66249. Seems fairly likely to me that issue causes memory usage increases in the EventPublisher process due to the jobs run in batch mode never actually ending.

Mar 27 '24 12:03 lomeroe

@dwoz i needed to do a flip on this row in the minion.py code if I don't do this I won't get any answers or scheduled events data from the deltaproxy minions to the master, it's just a quick fix probably not the best one, the issue is that the deltaproxy answer is a list if I remember correctly

@Zpell82 Can you open a separate issue for this?

Found the issue , it was that i hade my master : ["mastername"] in minion config , removed the brakets and it started working just fine

Apr 05 '24 10:04 Zpell82

@lomeroe's patch in #66249 worked for us in eliminating the memory leak we were observing in EventPublisher. We don't rely on orchestration, external job caches, or salt-api but do use batches. Thanks for the patch!

Jun 13 '24 07:06 thinkst-marco

salt salt copied to clipboard

[BUG] Memory Leak in EventPublisher process

salt
salt copied to clipboard