salt icon indicating copy to clipboard operation
salt copied to clipboard

[BUG] After upgrade to 3006.6 (onedir) connection issues between master and many minions

Open carsten-AEI opened this issue 1 year ago • 3 comments

Description We upgraded minions and master from 3004.2+ds-1 to 3006.6 (onedir) yesterday. Initial tests in dev environment showed no problems, but after prod upgrade, we encountered many issues:

We first hit #66006 and manually applied #66011 on the minions and then hit #64556 on the master and tried the referenced patch there-in. However, minions often are still not able to connect to the master and then start filling their log files with tracebacks a fresh start-up after removing the log file looks like:

2024-02-16 06:35:56,635 [salt.cli.daemons :281 ][INFO    ][2856] Setting up the Salt Minion "a0613"
2024-02-16 06:35:56,739 [salt.cli.daemons :83  ][INFO    ][2856] Starting up the Salt Minion
2024-02-16 06:35:56,740 [salt.utils.event :1132][INFO    ][2856] Starting pull socket on /var/run/salt/minion/minion_event_436c834f4b_pull.ipc
2024-02-16 06:35:57,252 [salt.minion      :1326][INFO    ][2856] Creating minion process manager
2024-02-16 06:36:57,264 [salt.minion      :1128][ERROR   ][2856] Minion unable to successfully connect to a Salt Master.
2024-02-16 06:37:27,664 [salt.minion      :1180][ERROR   ][2856] Error while bringing up minion for multi-master. Is master at saltmaster responding?
2024-02-16 06:39:30,584 [salt.minion      :1180][ERROR   ][2856] Error while bringing up minion for multi-master. Is master at saltmaster responding?
2024-02-16 06:41:30,916 [salt.minion      :1180][ERROR   ][2856] Error while bringing up minion for multi-master. Is master at saltmaster responding?
2024-02-16 06:42:46,054 [tornado.application:640 ][ERROR   ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fc7af80>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fca9570>)
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 603, in _send_recv
    recv = yield self.socket.recv()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
zmq.eventloop.future.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 605, in _send_recv
    future.set_exception(exc)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
    self.set_exc_info(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
    self._set_done()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
    for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
    ret = callback()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
    future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
    future.set_exception(exc)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
    self.set_exc_info(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
    self._set_done()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
    for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,057 [tornado.application:640 ][ERROR   ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fd26560>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd18100>)
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
    yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
    ret = callback()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
    future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
    future.set_exception(exc)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
    self.set_exc_info(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
    self._set_done()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
    for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,057 [salt.minion      :1180][ERROR   ][2856] Error while bringing up minion for multi-master. Is master at saltmine.atlas.local responding?
2024-02-16 06:42:46,058 [tornado.application:640 ][ERROR   ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fd26a70>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd18b80>)
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
    yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
    ret = callback()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
    future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
    future.set_exception(exc)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
    self.set_exc_info(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
    self._set_done()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
    for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,059 [tornado.application:640 ][ERROR   ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fd27250>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd1bf40>)
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
    yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
    ret = callback()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
    future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
    future.set_exception(exc)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
    self.set_exc_info(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
    self._set_done()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
    for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,059 [tornado.application:640 ][ERROR   ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fd27a30>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd3c910>)
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
    yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
    ret = callback()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
    future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
    future.set_exception(exc)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
    self.set_exc_info(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
    self._set_done()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
    for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,060 [tornado.application:640 ][ERROR   ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb37e70c280>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd3d2a0>)
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
    yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
    ret = callback()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
    return fn(*args, **kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
    future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
    yielded = self.gen.send(value)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
    future.set_exception(exc)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
    self.set_exc_info(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
    self._set_done()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
    for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:43:36,108 [salt.minion      :1200][CRITICAL][2856] Unexpected error while connecting to saltmine.atlas.local
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1174, in _connect_minion
    yield minion.connect_master(failed=failed)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
    yielded = next(result)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1401, in connect_master
    self.pub_channel.on_recv(None)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 467, in on_recv
    return self.transport.on_recv(None)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 284, in on_recv
    return self.stream.on_recv(callback)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 214, in on_recv
    self._check_closed()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 628, in _check_closed
    raise OSError("Stream is closed")

The last traceback is then repeated at a rate of about 1000Hz exhausting the available space on /var/log within minutes.

On the master, I only find this at info level:

2024-02-16 06:35:57,504 [salt.channel.server:298 ][INFO    ][4015692] Authentication request from a0613
2024-02-16 06:35:57,505 [salt.channel.server:574 ][INFO    ][4015692] Authentication accepted from a0613

Minions only have a single master defined, i.e. I'm not sure about the "multi-master" error message.

Setup Please be as specific as possible and give set-up details.

  • [X] on-prem machine
  • [X] onedir packaging

master:

salt-run --versions-report
Salt Version:
          Salt: 3006.6

Python Version:
        Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.16.0
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: 1.7.1
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: 1.13.2
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.14.2
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.10.0-21-amd64
        system: Linux
       version: Debian GNU/Linux 11 bullseye

most minions:

 salt-call --versions-report
Salt Version:
          Salt: 3006.6

Python Version:
        Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.14.2
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 10 buster
        locale: utf-8
       machine: x86_64
       release: 4.19.0-26-amd64
        system: Linux
       version: Debian GNU/Linux 10 buster

Steps to Reproduce the behavior A few minions does not seem to be a problem, adding many seems to trigger this (our goal is going back up to 3000 as before, right now, the limit seems to be around 200

Expected behavior All minions should be able to connect as before and saltmaster should be able to configure all minions.

Versions Report master:

salt-run --versions-report
Salt Version:
          Salt: 3006.6

Python Version:
        Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.16.0
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: 1.7.1
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: 1.13.2
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.14.2
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.10.0-21-amd64
        system: Linux
       version: Debian GNU/Linux 11 bullseye

most minions:

Salt Version:
          Salt: 3006.6

Python Version:
        Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: 18.6.1
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.3
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.14.2
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 10 buster
        locale: utf-8
       machine: x86_64
       release: 4.19.0-26-amd64
        system: Linux
       version: Debian GNU/Linux 10 buster

carsten-AEI avatar Feb 16 '24 07:02 carsten-AEI

I'm hitting the same issue on FreeBSD-14.0

willjp avatar Feb 18 '24 23:02 willjp

@carsten-AEI , @willjp

Please try increasing request_channel_timeout in your minion config to see if it helps the problems you are seeing.

May be related to: #66061

dwoz avatar Feb 19 '24 01:02 dwoz

Hi @dwoz - thanks for the pointer! First tests seem to confirm this (well, first tests after :coffee: as I initially tried to apply this setting to master) fixed the immediate issue, I'm seeing!

(Possibly, this problem hit us as we use an external sqlite3 pillar with a custom patch to sql_base.py to return data based on minion's nodegroups. It seems calling is not cached and quite slow requiring a long enough timeout here - maybe a ToDo item for us to check to use a more efficient method.)

Edit: "culprit" seemed to be from #64884 where I don't under stand why both REQUEST_CHANNEL_TIMEOUT = 60 and "request_channel_timeout": 30, are added while - at least for me - contradicting each other. But then, I may not understand all aspects of the PR.

But setting the channel timeouts on our minions large enough solves the immediate issue seen, feel free to close this one!

carsten-AEI avatar Feb 19 '24 09:02 carsten-AEI

That worked for me as well! Thank you @dwoz !

willjp avatar Feb 19 '24 18:02 willjp

Thanks for confirming this worked for you. The default will get set back to 60 seconds in 3006.7.

dwoz avatar Feb 20 '24 06:02 dwoz