salt icon indicating copy to clipboard operation
salt copied to clipboard

Intermittent ZMQ issues

Open xorinzor opened this issue 3 months ago • 7 comments

Description

We have 2 different VMs, each running a separate salt-master for different environments and recently upgraded to 3007 (onedir). At random the journalctl -xeu salt-master will show the error below, after which the /usr/bin/salt-master ReqServer ReqServer_ProcessManager RequestServer.zmq_device will use 100% CPU and no more events are processed by the salt-master (salt-run state.event pretty=True displays nothing).

Mar 27 20:37:41 saltmaster1 salt-master[3357285]: [ERROR   ] Exception in request handler
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: Traceback (most recent call last):
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:     request = await asyncio.wait_for(self._socket.recv(), 0.3)
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:     return fut.result()
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:     result = recv(**kwargs)
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
Mar 27 20:37:41 saltmaster1 salt-master[3357285]:   File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
Mar 27 20:37:41 saltmaster1 salt-master[3357285]: zmq.error.Again: Resource temporarily unavailable

strace on the ReqServer_ProcessManager RequestServer.zmq_device process shows this over and over again:

getpid()                                = 95422
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 95422
poll([{fd=46, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}, {fd=46, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 95422
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 95422
poll([{fd=46, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}, {fd=46, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 95422
poll([{fd=44, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 95422
poll([{fd=46, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=44, events=POLLIN}, {fd=46, events=POLLIN}], 2, 0) = 0 (Timeout)

Executing service salt-master stop then shows this:

Mar 27 20:37:41 saltmaster1 salt-master[3357285]: zmq.error.Again: Resource temporarily unavailable
Mar 28 09:36:02 saltmaster1 systemd[1]: Stopping The Salt Master Server...
-- Subject: A stop job for unit salt-master.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- A stop job for unit salt-master.service has begun execution.
--
-- The job identifier is 119404.
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: [ERROR   ] Future exception was never retrieved
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: future: <Future finished exception=ImportError("cannot import name 'wait' from partially initialized module 'multiprocessing.connection' (most likely due to a circular import) (/opt/saltstack/salt/lib/python3.10/multiprocessing/connection.py)")>
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: Traceback (most recent call last):
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/tornado/gen.py", line 234, in wrapper
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     yielded = ctx_run(next, result)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 632, in run
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     time.sleep(10)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1028, in _handle_signals
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     self.destroy(signum)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1089, in destroy
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     self.process_manager.kill_children()
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 704, in kill_children
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     p_map["Process"].join(0)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 149, in join
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     res = self._popen.wait(timeout)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/multiprocessing/popen_fork.py", line 39, in wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     from multiprocessing.connection import wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1028, in _handle_signals
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     self.destroy(signum)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/master.py", line 1089, in destroy
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     self.process_manager.kill_children()
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 704, in kill_children
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     p_map["Process"].join(0)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 149, in join
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     res = self._popen.wait(timeout)
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:   File "/opt/saltstack/salt/lib/python3.10/multiprocessing/popen_fork.py", line 39, in wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]:     from multiprocessing.connection import wait
Mar 28 09:36:02 saltmaster1 salt-master[3357280]: ImportError: cannot import name 'wait' from partially initialized module 'multiprocessing.connection' (most likely due to a circular import) (/opt/saltstack/salt/lib/python3.10/multiprocessing/connection.py)
Mar 28 09:36:07 saltmaster1 salt-master[3357204]: [WARNING ] Master received a SIGTERM. Exiting.
Mar 28 09:36:07 saltmaster1 salt-master[3357204]: The salt master is shutdown. Master received a SIGTERM. Exited.
Mar 28 09:36:07 saltmaster1 systemd[1]: salt-master.service: Succeeded.

Looking around the server there's nothing that stands out as being wrong. Using the information provided on the troubleshooting page of saltstack unfortunately did not help resolve the issue.

Setup

  • [ ] on-prem machine
  • [ ] VM (Virtualbox, KVM, etc. please specify)
  • [X] VM running on a cloud service
    • Provider: DigitalOcean
    • VM1: 2 cores, 2GB RAM
    • VM2: 1 core, 2GB RAM
  • [ ] container (Kubernetes, Docker, containerd, etc. please specify)
  • [ ] or a combination, please be explicit
  • [ ] jails if it is FreeBSD
  • [ ] classic packaging
  • [X] onedir packaging
  • [ ] used bootstrap to install

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3007.0

Python Version:
        Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.8.0
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: 1.6.4
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 23.1
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: 1.12.1
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.15.1
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.3.3
           ZMQ: 4.3.4

Salt Package Information:
  Package Type: onedir

System Versions:
          dist: ubuntu 20.04.6 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-173-generic
        system: Linux
       version: Ubuntu 20.04.6 focal

Additional context Add any other context about the problem here.

xorinzor avatar Mar 29 '24 07:03 xorinzor

Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:

There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!

welcome[bot] avatar Mar 29 '24 07:03 welcome[bot]

Tried stopping both the salt-minion and salt-master service and clearing all Python Cache files in the /opt/saltstack/salt/lib/**/__pycache__/*.pyc before starting the services again.

The exception still keeps appearing in the salt-master journal. Although so far we have not observed the 100% CPU problem or minion connectivity issues anymore.

xorinzor avatar Apr 08 '24 08:04 xorinzor

Sorry for the me-too, but me too. We've got around 500 minions on this server, it will now stop responding almost daily with this same error in the log. We don't see 100% CPU usage when its stopped. Restarting the service gets things going again for a while:

[salt.transport.zeromq:572 ][ERROR   ][3352122] Exception in request handler
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 566, in request_handler
    request = await asyncio.wait_for(self._socket.recv(), 0.3)
  File "/opt/saltstack/salt/lib/python3.10/asyncio/tasks.py", line 445, in wait_for
    return fut.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/_future.py", line 598, in _handle_recv
    result = recv(**kwargs)
  File "zmq/backend/cython/socket.pyx", line 805, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 841, in zmq.backend.cython.socket.Socket.recv
  File "zmq/backend/cython/socket.pyx", line 199, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/socket.pyx", line 194, in zmq.backend.cython.socket._recv_copy
  File "zmq/backend/cython/checkrc.pxd", line 22, in zmq.backend.cython.checkrc._check_rc
zmq.error.Again: Resource temporarily unavailable

Infra: VMware ubuntu 20.04 guest, 16G, 2 cores

Versions:

Salt Version:                                                                                                                                                                                                                                                         [2/631]
          Salt: 3007.0
                          
Python Version:
        Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]
  
Dependency Versions:
          cffi: 1.16.0
      cherrypy: unknown
      dateutil: 2.8.2
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.3.0
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.7
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 23.1
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.5.2
        PyYAML: 6.0.1
         PyZMQ: 25.1.2
        relenv: 0.15.1
         smmap: Not Installed
       timelib: 0.3.0
       Tornado: 6.3.3
           ZMQ: 4.3.4
  
Salt Package Information:
  Package Type: onedir
  
System Versions:
          dist: ubuntu 20.04.6 focal
        locale: utf-8
       machine: x86_64
       release: 5.4.0-173-generic
        system: Linux
       version: Ubuntu 20.04.6 focal

Shout if there's any more info that would be helpful

HG00 avatar Apr 17 '24 09:04 HG00

Happened again today. This time it's the /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorkerQueue process causing problems.

strace shows a loop:

getpid()                                = 138187
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 138187
poll([{fd=47, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 138187
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 138187
poll([{fd=47, events=POLLIN}], 1, 0)    = 0 (Timeout)
getpid()                                = 138187
poll([{fd=49, events=POLLIN}], 1, 0)    = 0 (Timeout)
poll([{fd=47, events=POLLIN}, {fd=49, events=POLLIN}], 2, 0) = 0 (Timeout)
getpid()                                = 138187
poll([{fd=47, events=POLLIN}], 1, 0)    = 0 (Timeout)

File descriptors point to a stream.

COMMAND      PID USER   FD      TYPE             DEVICE SIZE/OFF    NODE NAME
/opt/salt 138187 root   47u     unix 0xffff8c57afd7d980      0t0 3514615 type=STREAM
/opt/salt 138187 root   49u     unix 0xffff8c57afd7e640      0t0 3514617 type=STREAM

xorinzor avatar Apr 17 '24 12:04 xorinzor

I'm having the same issue. Running v3007.0-0 on master and minions. Master's are on Rocky Linux 8 and minions are a mix of RL8 and CentOS 7.

mshields-frtservices avatar Apr 18 '24 19:04 mshields-frtservices

We are also seeing this problem, running v3007.0 on Ubuntu 22.04. Salt Master seems to run fine for a while and then locks up with same errors as seen in original issue. Restarting the salt-master service fixes the problem but it seems to occur again within 24 hours. However we are not seeing 100% CPU usage problem in fact CPU usage remains very low on our master.

jsansone-pw avatar May 02 '24 14:05 jsansone-pw