salt
salt copied to clipboard
[BUG] Big payload cuts master from minion
Description Upon big payload (350Kb) a zombie process appears in salt-minion. Salt-master is not able to communicate with the salt-minion but salt-minion is able to communicate with salt-master.
Resolution:
- Either restart the salt-minion
- Or set
net.core.wmem_defaultto something bigger then the default (212992) - in our case26214400
Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Please be as specific as possible and give set-up details.
- [x] on-prem machine
- [x] VM (Virtualbox, KVM, etc. please specify)
- [ ] VM running on a cloud service, please be explicit and add details
- [ ] container (Kubernetes, Docker, containerd, etc. please specify)
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [x] classic packaging
- [x] onedir packaging
- [ ] used bootstrap to install
Steps to Reproduce the behavior (Include debug logs if possible and relevant)
Create a state with big payload or reduce net.core.wmem_default
During applying the state:
19048 ? Ss 00:00:00 /usr/bin/python3 /usr/bin/salt-minion
19068 ? Sl 00:00:02 \_ /usr/bin/python3 /usr/bin/salt-minion
28158 ? Sl 00:00:00 \_ /usr/bin/python3 /usr/bin/salt-minion
After the state is applied:
PID TT STAT TIME CMD
19048 ? Ss 00:00:00 /usr/bin/python3 /usr/bin/salt-minion
19068 ? Sl 00:00:02 \_ /usr/bin/python3 /usr/bin/salt-minion
28158 ? Z 00:00:05 \_ [salt-minion] <defunct>
From the log:
Apr 13 08:37:36 salt-master salt-minion[28158]: --- Logging error ---
Apr 13 08:37:36 salt-master salt-minion[28158]: Traceback (most recent call last):
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/logging/handlers.py", line 934, in emit
Apr 13 08:37:36 salt-master salt-minion[28158]: self.socket.send(msg)
Apr 13 08:37:36 salt-master salt-minion[28158]: OSError: [Errno 90] Message too long
Apr 13 08:37:36 salt-master salt-minion[28158]: During handling of the above exception, another exception occurred:
Apr 13 08:37:36 salt-master salt-minion[28158]: Traceback (most recent call last):
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/logging/handlers.py", line 938, in emit
Apr 13 08:37:36 salt-master salt-minion[28158]: self.socket.send(msg)
Apr 13 08:37:36 salt-master salt-minion[28158]: OSError: [Errno 90] Message too long
Apr 13 08:37:36 salt-master salt-minion[28158]: Call stack:
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/bin/salt-minion", line 11, in <module>
Apr 13 08:37:36 salt-master salt-minion[28158]: load_entry_point('salt==3005.1', 'console_scripts', 'salt-minion')()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/scripts.py", line 213, in salt_minion
Apr 13 08:37:36 salt-master salt-minion[28158]: process.start()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/process.py", line 121, in start
Apr 13 08:37:36 salt-master salt-minion[28158]: self._popen = self._Popen(self)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
Apr 13 08:37:36 salt-master salt-minion[28158]: return _default_context.get_context().Process._Popen(process_obj)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
Apr 13 08:37:36 salt-master salt-minion[28158]: return Popen(process_obj)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
Apr 13 08:37:36 salt-master salt-minion[28158]: self._launch(process_obj)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 75, in _launch
Apr 13 08:37:36 salt-master salt-minion[28158]: code = process_obj._bootstrap(parent_sentinel=child_r)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
Apr 13 08:37:36 salt-master salt-minion[28158]: self.run()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
Apr 13 08:37:36 salt-master salt-minion[28158]: self._target(*self._args, **self._kwargs)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/scripts.py", line 142, in minion_process
Apr 13 08:37:36 salt-master salt-minion[28158]: minion.start()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/cli/daemons.py", line 325, in start
Apr 13 08:37:36 salt-master salt-minion[28158]: self._real_start()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/cli/daemons.py", line 337, in _real_start
Apr 13 08:37:36 salt-master salt-minion[28158]: self.minion.tune_in()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/minion.py", line 1182, in tune_in
Apr 13 08:37:36 salt-master salt-minion[28158]: self.io_loop.start()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 834, in start
Apr 13 08:37:36 salt-master salt-minion[28158]: self._run_callback(self._callbacks.popleft())
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
Apr 13 08:37:36 salt-master salt-minion[28158]: ret = callback()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
Apr 13 08:37:36 salt-master salt-minion[28158]: return fn(*args, **kwargs)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 294, in wrapper
Apr 13 08:37:36 salt-master salt-minion[28158]: result = func(*args, **kwargs)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/types.py", line 278, in wrapped
Apr 13 08:37:36 salt-master salt-minion[28158]: coro = func(*args, **kwargs)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/minion.py", line 1059, in handle_event
Apr 13 08:37:36 salt-master salt-minion[28158]: minion.handle_event(package)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 309, in wrapper
Apr 13 08:37:36 salt-master salt-minion[28158]: yielded = next(result)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/minion.py", line 2712, in handle_event
Apr 13 08:37:36 salt-master salt-minion[28158]: _minion.manage_schedule(tag, data)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/minion.py", line 2561, in manage_schedule
Apr 13 08:37:36 salt-master salt-minion[28158]: getattr(self.schedule, alias)(*params)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/utils/schedule.py", line 511, in run_job
Apr 13 08:37:36 salt-master salt-minion[28158]: self._run_job(func, data)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/utils/schedule.py", line 1822, in _run_job
Apr 13 08:37:36 salt-master salt-minion[28158]: proc.start()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/utils/process.py", line 1097, in start
Apr 13 08:37:36 salt-master salt-minion[28158]: super().start()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/process.py", line 121, in start
Apr 13 08:37:36 salt-master salt-minion[28158]: self._popen = self._Popen(self)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
Apr 13 08:37:36 salt-master salt-minion[28158]: return _default_context.get_context().Process._Popen(process_obj)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
Apr 13 08:37:36 salt-master salt-minion[28158]: return Popen(process_obj)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
Apr 13 08:37:36 salt-master salt-minion[28158]: self._launch(process_obj)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/popen_fork.py", line 75, in _launch
Apr 13 08:37:36 salt-master salt-minion[28158]: code = process_obj._bootstrap(parent_sentinel=child_r)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
Apr 13 08:37:36 salt-master salt-minion[28158]: self.run()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/utils/process.py", line 993, in wrapped_run_func
Apr 13 08:37:36 salt-master salt-minion[28158]: return run_func()
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
Apr 13 08:37:36 salt-master salt-minion[28158]: self._target(*self._args, **self._kwargs)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/utils/schedule.py", line 912, in handle_func
Apr 13 08:37:36 salt-master salt-minion[28158]: event.fire_event(load, "__schedule_return")
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/utils/event.py", line 821, in fire_event
Apr 13 08:37:36 salt-master salt-minion[28158]: log.debug("Sending event: tag = %s; data = %s", tag, data)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3.8/logging/__init__.py", line 1434, in debug
Apr 13 08:37:36 salt-master salt-minion[28158]: self._log(DEBUG, msg, args, **kwargs)
Apr 13 08:37:36 salt-master salt-minion[28158]: File "/usr/lib/python3/dist-packages/salt/_logging/impl.py", line 284, in _log
Apr 13 08:37:36 salt-master salt-minion[28158]: LOGGING_LOGGER_CLASS._log(
Apr 13 08:37:36 salt-master salt-minion[28158]: Message: 'Sending event: tag = %s; data = %s'
From other server with onedir the linux process table provides the more context for the same command
PID TT STAT TIME CMD
3909307 ? Sl 00:00:01 \_ /opt/saltstack/salt/run/run minion MultiMinionProcessManager MinionProcessManager
3914621 ? Sl 00:00:00 \_ /opt/saltstack/salt/run/run minion MultiMinionProcessManager MinionProcessManager Schedule(name=formula.salt.minion-schedule-highstate, jid=None)
Expected behavior Documentation how to setup wmem_default related to salt. Meaningful log message. Prevent zombie process and cutting salt-master from the minion.
Screenshots
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)# salt --versions
Salt Version:
Salt: 3005.1
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.7.3
docker-py: 4.1.0
gitdb: 2.0.6
gitpython: 3.0.7
Jinja2: 2.10.1
libgit2: 0.28.3
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 0.6.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: 3.6.1
pygit2: 1.0.3
Python: 3.8.10 (default, Mar 13 2023, 10:26:41)
python-gnupg: 0.4.5
PyYAML: 5.3.1
PyZMQ: 20.0.0
smmap: 2.0.5
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.2
System Versions:
dist: ubuntu 20.04 focal
locale: utf-8
machine: x86_64
release: 5.4.0-139-generic
system: Linux
version: Ubuntu 20.04 focal
Additional context Add any other context about the problem here.
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:
- Community Wiki
- Salt’s Contributor Guide
- Join our Community Slack
- IRC on LiberaChat
- Salt Project YouTube channel
- Salt Project Twitch channel
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!
I think we're having the same problem, but on FreeBSD we don't get an exception. I note that I don't manage any of this, I'm just debugging it; my terminology in places may be completely wrong.
The default recv buffer size on FreeBSD is 8KB. It goes to send this massive payload after the child exits:
2023-04-18 17:07:14,575 [salt.transport.ipc :347 ][DEBUG ][8693] IPCClient: Connecting to socket: /var/run/salt/minion/minion_event_1842f8bb1a_pull.ipc
Noting that pid 8693 here is the salt-minion process:
8693 - I 0:01.57 |-- /usr/local/bin/python3.9 /usr/local/bin/salt-minion -c /usr/local/etc/salt --pid-file=/var/run/salt-minion.pid -d
8810 - Z 0:03.04 | `-- <defunct>
The problem is that pid 8693 is also the process responsible for reading /var/run/salt/minion/minion_event_1842f8bb1a_pull.ipc. In the case I'm looking at, we try to write a 29K payload with the tag __master_req_channel_payload to bubble it back up to the master; it blocks because we write directly to the recv buffer, but the recv buffer won't drain because the writing thread is the same one that would drain the recv buffer for us to be able to write more.
I am seeing this too
We also suffer from this bug, but the following patch fixes it:
diff --git a/sysutils/py-salt/files/patch-salt_ext_tornado_iostream.py b/sysutils/py-salt/files/patch-salt_ext_tornado_iostream.py
new file mode 100644
index 000000000000..56792d9c098e
--- /dev/null
+++ b/sysutils/py-salt/files/patch-salt_ext_tornado_iostream.py
@@ -0,0 +1,10 @@
+--- salt/ext/tornado/iostream.py.orig 2022-11-02 13:25:04 UTC
++++ salt/ext/tornado/iostream.py
+@@ -1116,6 +1116,7 @@ class IOStream(BaseIOStream):
+ future = self._connect_future = TracebackFuture()
+ try:
+ self.socket.connect(address)
++ self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 524288)
+ except socket.error as e:
+ # In non-blocking mode we expect connect() to raise an
+ # exception with EINPROGRESS or EWOULDBLOCK.
That's a little less invasive than raising the sndbuf size system-wide, but that would seem to just raise the bar to requiring a 512K payload to break... is that pretty far out of the realm of realistic?
I agree that it's not a perfect solution.
I agree that it's not a perfect solution.
IMO we should still add it to sysutils/py-salt for now, maybe with a brief description of how to identify the bug in case some system still hits it even with the much higher limit.
CC @devkits, this looks to be the end result of https://github.com/saltstack/salt/commit/80ae5188807550e7592fa12d8661ee83c4313ec8
I've proposed the above workaround patch for the FreeBSD ports tree here: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=271367
Ideally we'd have another thread reading the other end of the socket, since we can't really enqueue large payloads into <= SO_SNDBUF size chunks and gradually send them to be received.
So, right now i am looking into this. We might have to go the increased buffer route for the point release fix. I'm thinking increasing to 1meg which is the default max msgpack size.
the separating out the read and write processes is something we want to look at it isn't something that should go in a point release.
At some point we should be able to page large payloads.
@whytewolf Is there a schedule for the 3006.2 release? Sorry I am not familiar with the SaltStack release process.
So, right now i am looking into this. We might have to go the increased buffer route for the point release fix. I'm thinking increasing to 1meg which is the default max msgpack size.
That does seem fine... presumably if a payload tries to push past the max msgpack size that'll likely fail somewhere earlier anyways? I can't imagine salt running on systems where an 8KB -> 1MB (or, in the case of OP, 200K -> 1MB) jump in RAM is going to break the bank.
same problem here after upgrade from 3004 to 3006.1. Fixed increasing wmem, but I'm trying to found any other option that reduces payload instead of increment available memory. Any idea has been welcome. In my scenario, the problem is only produced when highstate is executed with schedule module (schedule.run highstate), if is runned directly (state.apply) no issues.
so looking at this again. this might not be added yet. as the change is not in salt code but in the vendored version of tornado, which is being removed in 3007. once the vendored version of tornado is removed this will break again without salt being able to fix it the same way.
there is an operating system way of getting around this. which is to raise the system wide wmem.
going to look at a more perm solution instead of raising that value. but that does mean this isn't coming in 3006.3.
We hit this problem a few days ago, with macOS "Big Sur" minions. With previous version on macOS ("High Sierra") we had no problem in scheduling highstate at minion startup.
Instead of setting the net.local.stream.sendspace sysctl to a higher value, we solved by making SaltStack patch itself by inserting these formulas in the first lines of our automatically applied state file:
{% if grains['osrelease'] >= '11' %}
# See https://github.com/saltstack/salt/issues/64074#issuecomment-1544283770
BigSur_workaround:
file.patch:
- name: {{ grains['saltpath'] }}/ext/tornado/iostream.py
- source: salt://files/iostream.py.patch
{% endif %}
minion_restart:
cmd.run:
- name: launchctl kickstart -k system/com.saltstack.salt.minion
- bg: True
- onchanges:
- BigSur_workaround
Here is the content of the iostream.py.patch file:
--- iostream.py 2023-10-17 19:53:00.000000000 +0200
+++ iostream.py 2023-10-17 19:53:07.000000000 +0200
@@ -1116,6 +1116,7 @@
future = self._connect_future = TracebackFuture()
try:
self.socket.connect(address)
+ self.socket.setsockopt(socket.SOL_SOCKET, socket.SO_SNDBUF, 524288)
except socket.error as e:
# In non-blocking mode we expect connect() to raise an
# exception with EINPROGRESS or EWOULDBLOCK.
It has proven to be a valid solution for us so far but, since I'm pretty new to SaltStack, I'd like to know your opinion about this approach.
We no longer use a tcp socket in our log handlers. So I believe this is no longer a valid issue. If someone is able to re-produce this issue in a current version of Salt (3006.8 or newer). Please re-open it.
Sorry @dwoz , I just tested version 3006.8 on macOS (the master is running the same version on FreeBSD), and the issue persists: we still need to patch the file iostream.py to keep our minions working. And unfortunately we are not able to use the 3007 branch on FreeBSD yet, due to this: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=278149