salt
salt copied to clipboard
Intermittent ZMQ issues
Description
We have 2 different VMs, each running a separate salt-master for different environments and recently upgraded to 3007 (onedir).
At random the journalctl -xeu salt-master
will show the error below, after which the /usr/bin/salt-master ReqServer ReqServer_ProcessManager RequestServer.zmq_device
will use 100% CPU and no more events are processed by the salt-master (salt-run state.event pretty=True
displays nothing).
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: [ERROR ] Exception in request handler
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: Traceback (most recent call last):
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: request = await asyncio.wait_for(self._socket.recv(), 0.3)
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: return fut.result()
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: result = recv(**kwargs)
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: zmq.error.Again: Resource temporarily unavailable
strace
on the ReqServer_ProcessManager RequestServer.zmq_device
process shows this over and over again:
getpid() = 95422
poll([{fd=44, events=POLLIN}], 1, 0) = 0 (Timeout)
getpid() = 95422
poll([{fd=46, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=44, events=POLLIN}, {fd=46, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid() = 95422
poll([{fd=44, events=POLLIN}], 1, 0) = 0 (Timeout)
getpid() = 95422
poll([{fd=46, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=44, events=POLLIN}, {fd=46, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid() = 95422
poll([{fd=44, events=POLLIN}], 1, 0) = 0 (Timeout)
getpid() = 95422
poll([{fd=46, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=44, events=POLLIN}, {fd=46, events=POLLIN}], 2, 0) = 0 (Timeout)
Executing service salt-master stop
then shows this:
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: zmq.error.Again: Resource temporarily unavailable
Mar 28 09:36:02 saltmaster1 systemd[1]: Stopping The Salt Master Server...
-- Subject: A stop job for unit salt-master.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit salt-master.service has begun execution.
--
-- The job identifier is 119404.
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: [ERROR ] Future exception was never retrieved
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: future: <Future finished exception=ImportError("cannot import name 'wait' from partially initialized module 'multiprocessing.connection' (most likely due to a circular import) (/opt/saltstack/salt/lib/python3.10/multiprocessing/connection.py)")>
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: Traceback (most recent call last):
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/site-packages/tornado/gen.py", line 234, in wrapper
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: yielded = ctx_run(next, result)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 632, in run
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: time.sleep(10)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1028, in _handle_signals
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: self.destroy(signum)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1089, in destroy
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: self.process_manager.kill_children()
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 704, in kill_children
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: p_map["Process"].join(0)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 149, in join
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: res = self._popen.wait(timeout)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/multiprocessing/popen_fork.py", line 39, in wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: from multiprocessing.connection import wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1028, in _handle_signals
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: self.destroy(signum)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1089, in destroy
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: self.process_manager.kill_children()
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 704, in kill_children
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: p_map["Process"].join(0)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 149, in join
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: res = self._popen.wait(timeout)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: File "/opt/saltstack/salt/lib/python3.10/multiprocessing/popen_fork.py", line 39, in wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: from multiprocessing.connection import wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: ImportError: cannot import name 'wait' from partially initialized module 'multiprocessing.connection' (most likely due to a circular import) (/opt/saltstack/salt/lib/python3.10/multiprocessing/connection.py)
Mar 28 09:36:07 saltmaster1 salt-master[3357204]: [WARNING ] Master received a SIGTERM. Exiting.
Mar 28 09:36:07 saltmaster1 salt-master[3357204]: The salt master is shutdown. Master received a SIGTERM. Exited.
Mar 28 09:36:07 saltmaster1 systemd[1]: salt-master.service: Succeeded.
Looking around the server there's nothing that stands out as being wrong. Using the information provided on the troubleshooting page of saltstack unfortunately did not help resolve the issue.
Setup
- [ ] on-prem machine
- [ ] VM (Virtualbox, KVM, etc. please specify)
- [X] VM running on a cloud service
- Provider: DigitalOcean
- VM1: 2 cores, 2GB RAM
- VM2: 1 core, 2GB RAM
- [ ] container (Kubernetes, Docker, containerd, etc. please specify)
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [ ] classic packaging
- [X] onedir packaging
- [ ] used bootstrap to install
Screenshots If applicable, add screenshots to help explain your problem.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Salt Version:
Salt: 3007.0
Python Version:
Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]
Dependency Versions:
cffi: 1.14.6
cherrypy: 18.8.0
dateutil: 2.8.2
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: 1.6.4
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.7
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 23.1
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: 1.12.1
python-gnupg: 0.5.2
PyYAML: 6.0.1
PyZMQ: 25.1.2
relenv: 0.15.1
smmap: Not Installed
timelib: 0.3.0
Tornado: 6.3.3
ZMQ: 4.3.4
Salt Package Information:
Package Type: onedir
System Versions:
dist: ubuntu 20.04.6 focal
locale: utf-8
machine: x86_64
release: 5.4.0-173-generic
system: Linux
version: Ubuntu 20.04.6 focal
Additional context Add any other context about the problem here.
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:
- Community Wiki
- Salt’s Contributor Guide
- Join our Community Slack
- IRC on LiberaChat
- Salt Project YouTube channel
- Salt Project Twitch channel
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!
Tried stopping both the salt-minion
and salt-master
service and clearing all Python Cache files in the /opt/saltstack/salt/lib/**/__pycache__/*.pyc
before starting the services again.
The exception still keeps appearing in the salt-master
journal. Although so far we have not observed the 100% CPU problem or minion connectivity issues anymore.
Sorry for the me-too, but me too. We've got around 500 minions on this server, it will now stop responding almost daily with this same error in the log. We don't see 100% CPU usage when its stopped. Restarting the service gets things going again for a while:
[salt.transport.zeromq:572 ][ERROR ][3352122] Exception in request handler
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
request = await asyncio.wait_for(self._socket.recv(), 0.3)
File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
return fut.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
result = recv(**kwargs)
File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable
Infra: VMware ubuntu 20.04 guest, 16G, 2 cores
Versions:
Salt Version: [2/631]
Salt: 3007.0
Python Version:
Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]
Dependency Versions:
cffi: 1.16.0
cherrypy: unknown
dateutil: 2.8.2
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: Not Installed
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.7
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 23.1
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: Not Installed
python-gnupg: 0.5.2
PyYAML: 6.0.1
PyZMQ: 25.1.2
relenv: 0.15.1
smmap: Not Installed
timelib: 0.3.0
Tornado: 6.3.3
ZMQ: 4.3.4
Salt Package Information:
Package Type: onedir
System Versions:
dist: ubuntu 20.04.6 focal
locale: utf-8
machine: x86_64
release: 5.4.0-173-generic
system: Linux
version: Ubuntu 20.04.6 focal
Shout if there's any more info that would be helpful
Happened again today. This time it's the /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorkerQueue
process causing problems.
strace
shows a loop:
getpid() = 138187
poll([{fd=49, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=47, events=POLLIN}], 1, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=49, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=47, events=POLLIN}], 1, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=49, events=POLLIN}], 1, 0) = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid() = 138187
poll([{fd=47, events=POLLIN}], 1, 0) = 0 (Timeout)
File descriptors point to a stream.
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
/opt/salt 138187 root 47u unix 0xffff8c57afd7d980 0t0 3514615 type=STREAM
/opt/salt 138187 root 49u unix 0xffff8c57afd7e640 0t0 3514617 type=STREAM
I'm having the same issue. Running v3007.0-0 on master and minions. Master's are on Rocky Linux 8 and minions are a mix of RL8 and CentOS 7.
We are also seeing this problem, running v3007.0 on Ubuntu 22.04. Salt Master seems to run fine for a while and then locks up with same errors as seen in original issue. Restarting the salt-master service fixes the problem but it seems to occur again within 24 hours. However we are not seeing 100% CPU usage problem in fact CPU usage remains very low on our master.
Rolled back our minions and master to 3006.8 and the issue stopped occurring. I suspect this has something to do with the tcp transport changes implemented with ZMQ in 3007.x
I'm have the same problem.
- The minion's version is
v3006.5
. When master upgradedv3007.1
fromv3006.5
, the master will most likely not be able to contact the minion.- For example:
salt '*' test.ping
, will returnMinion did not return. [Not connected]
- For example:
- Rolled back our master to 3006.5 and the issue stopped occurring.
Hello,
Same problem in my side.
We are on rocky linux 9 with version 3005.4
.
I had to restart salt-master every day ( or even more often depending on the day ).
@henri9813 Salt 3005.4 is EOL, recommend upgrading to a currently supported version of Salt, 3006.9 or 3007.1. @dwoz This might interest you in zeromq problems introduced with 3007.x
Sorry, I badly read the dnf info, I'm on Version : 3007.1
We are also seeing this problem, running v3007.0 on Ubuntu 22.04. Salt Master seems to run fine for a while and then locks up with same errors as seen in original issue. Restarting the salt-master service fixes the problem but it seems to occur again within 24 hours. However we are not seeing 100% CPU usage problem in fact CPU usage remains very low on our master.
Do you have vulnerability scanning going on in your environments? I had the same errors showing up in my logs before my masters would become unresponsive but CPU usage didn't spike. I was able to correlate the logs with when vulnerability scans were occurring and prodding at the master's ports.
More detail in my issue: #66519
@clayoster That's really interesting - yes we do, and we also have the same software you mentioned in your issue (Nessus Tenable Vulnerability Scanning). This would make a lot of sense as well as it always seemed like the Salt Master froze around a certain time of day (early morning AM usually). Did you find any way to prevent the issue? I am wondering if blocking connections from the scanning systems to those ports would do it.
@clayoster we don't have something like that running.
For us it also appeared to happen really randomly, even while actively monitoring the server and it's journals.
Currently we have downgraded and not had any issues since.
In my case, my instance is public, so it's possible it has been scanned by random bot.
@jsansone-pw - Yes that did successfully prevent the issue for me. I used a local firewall (nftables) to block connections to TCP 4505-4506 from the scan engine IPs and that kept my masters stable for over a month.