salt
salt copied to clipboard
[BUG] After upgrade to 3006.6 (onedir) connection issues between master and many minions
Description We upgraded minions and master from 3004.2+ds-1 to 3006.6 (onedir) yesterday. Initial tests in dev environment showed no problems, but after prod upgrade, we encountered many issues:
We first hit #66006 and manually applied #66011 on the minions and then hit #64556 on the master and tried the referenced patch there-in. However, minions often are still not able to connect to the master and then start filling their log files with tracebacks a fresh start-up after removing the log file looks like:
2024-02-16 06:35:56,635 [salt.cli.daemons :281 ][INFO ][2856] Setting up the Salt Minion "a0613"
2024-02-16 06:35:56,739 [salt.cli.daemons :83 ][INFO ][2856] Starting up the Salt Minion
2024-02-16 06:35:56,740 [salt.utils.event :1132][INFO ][2856] Starting pull socket on /var/run/salt/minion/minion_event_436c834f4b_pull.ipc
2024-02-16 06:35:57,252 [salt.minion :1326][INFO ][2856] Creating minion process manager
2024-02-16 06:36:57,264 [salt.minion :1128][ERROR ][2856] Minion unable to successfully connect to a Salt Master.
2024-02-16 06:37:27,664 [salt.minion :1180][ERROR ][2856] Error while bringing up minion for multi-master. Is master at saltmaster responding?
2024-02-16 06:39:30,584 [salt.minion :1180][ERROR ][2856] Error while bringing up minion for multi-master. Is master at saltmaster responding?
2024-02-16 06:41:30,916 [salt.minion :1180][ERROR ][2856] Error while bringing up minion for multi-master. Is master at saltmaster responding?
2024-02-16 06:42:46,054 [tornado.application:640 ][ERROR ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fc7af80>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fca9570>)
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 603, in _send_recv
recv = yield self.socket.recv()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
zmq.eventloop.future.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 605, in _send_recv
future.set_exception(exc)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
self.set_exc_info(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
self._set_done()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
ret = callback()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
future.set_exception(exc)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
self.set_exc_info(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
self._set_done()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,057 [tornado.application:640 ][ERROR ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fd26560>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd18100>)
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
ret = callback()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
yielded = self.gen.send(value)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
future.set_exception(exc)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
self.set_exc_info(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
self._set_done()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,057 [salt.minion :1180][ERROR ][2856] Error while bringing up minion for multi-master. Is master at saltmine.atlas.local responding?
2024-02-16 06:42:46,058 [tornado.application:640 ][ERROR ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fd26a70>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd18b80>)
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
ret = callback()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
yielded = self.gen.send(value)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
future.set_exception(exc)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
self.set_exc_info(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
self._set_done()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,059 [tornado.application:640 ][ERROR ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fd27250>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd1bf40>)
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
ret = callback()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
yielded = self.gen.send(value)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
future.set_exception(exc)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
self.set_exc_info(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
self._set_done()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,059 [tornado.application:640 ][ERROR ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb38fd27a30>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd3c910>)
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
ret = callback()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
yielded = self.gen.send(value)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
future.set_exception(exc)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
self.set_exc_info(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
self._set_done()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:42:46,060 [tornado.application:640 ][ERROR ][2856] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb37e70c280>, <salt.ext.tornado.concurrent.Future object at 0x7fb38fd3d2a0>)
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 601, in _send_recv
yield self.socket.send(message)
AttributeError: 'NoneType' object has no attribute 'send'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
ret = callback()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
yielded = self.gen.send(value)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 612, in _send_recv
future.set_exception(exc)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 294, in set_exception
self.set_exc_info(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 319, in set_exc_info
self._set_done()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 333, in _set_done
for cb in self._callbacks:
TypeError: 'NoneType' object is not iterable
2024-02-16 06:43:36,108 [salt.minion :1200][CRITICAL][2856] Unexpected error while connecting to saltmine.atlas.local
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1174, in _connect_minion
yield minion.connect_master(failed=failed)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 309, in wrapper
yielded = next(result)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1401, in connect_master
self.pub_channel.on_recv(None)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 467, in on_recv
return self.transport.on_recv(None)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 284, in on_recv
return self.stream.on_recv(callback)
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 214, in on_recv
self._check_closed()
File "/opt/saltstack/salt/lib/python3.10/site-packages/zmq/eventloop/zmqstream.py", line 628, in _check_closed
raise OSError("Stream is closed")
The last traceback is then repeated at a rate of about 1000Hz exhausting the available space on /var/log
within minutes.
On the master, I only find this at info
level:
2024-02-16 06:35:57,504 [salt.channel.server:298 ][INFO ][4015692] Authentication request from a0613
2024-02-16 06:35:57,505 [salt.channel.server:574 ][INFO ][4015692] Authentication accepted from a0613
Minions only have a single master defined, i.e. I'm not sure about the "multi-master" error message.
Setup Please be as specific as possible and give set-up details.
- [X] on-prem machine
- [X] onedir packaging
master:
salt-run --versions-report
Salt Version:
Salt: 3006.6
Python Version:
Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]
Dependency Versions:
cffi: 1.16.0
cherrypy: 18.6.1
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: 1.7.1
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: 1.13.2
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: 0.14.2
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: debian 11 bullseye
locale: utf-8
machine: x86_64
release: 5.10.0-21-amd64
system: Linux
version: Debian GNU/Linux 11 bullseye
most minions:
salt-call --versions-report
Salt Version:
Salt: 3006.6
Python Version:
Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]
Dependency Versions:
cffi: 1.14.6
cherrypy: 18.6.1
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: 0.14.2
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: debian 10 buster
locale: utf-8
machine: x86_64
release: 4.19.0-26-amd64
system: Linux
version: Debian GNU/Linux 10 buster
Steps to Reproduce the behavior A few minions does not seem to be a problem, adding many seems to trigger this (our goal is going back up to 3000 as before, right now, the limit seems to be around 200
Expected behavior All minions should be able to connect as before and saltmaster should be able to configure all minions.
Versions Report master:
salt-run --versions-report
Salt Version:
Salt: 3006.6
Python Version:
Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]
Dependency Versions:
cffi: 1.16.0
cherrypy: 18.6.1
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: 1.7.1
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: 1.13.2
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: 0.14.2
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: debian 11 bullseye
locale: utf-8
machine: x86_64
release: 5.10.0-21-amd64
system: Linux
version: Debian GNU/Linux 11 bullseye
most minions:
Salt Version:
Salt: 3006.6
Python Version:
Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]
Dependency Versions:
cffi: 1.14.6
cherrypy: 18.6.1
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: 0.14.2
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: debian 10 buster
locale: utf-8
machine: x86_64
release: 4.19.0-26-amd64
system: Linux
version: Debian GNU/Linux 10 buster
I'm hitting the same issue on FreeBSD-14.0
@carsten-AEI , @willjp
Please try increasing request_channel_timeout
in your minion config to see if it helps the problems you are seeing.
May be related to: #66061
Hi @dwoz - thanks for the pointer! First tests seem to confirm this (well, first tests after :coffee: as I initially tried to apply this setting to master) fixed the immediate issue, I'm seeing!
(Possibly, this problem hit us as we use an external sqlite3 pillar with a custom patch to sql_base.py to return data based on minion's nodegroups. It seems calling is not cached and quite slow requiring a long enough timeout here - maybe a ToDo item for us to check to use a more efficient method.)
Edit: "culprit" seemed to be from #64884 where I don't under stand why both REQUEST_CHANNEL_TIMEOUT = 60
and "request_channel_timeout": 30,
are added while - at least for me - contradicting each other. But then, I may not understand all aspects of the PR.
But setting the channel timeouts on our minions large enough solves the immediate issue seen, feel free to close this one!
That worked for me as well! Thank you @dwoz !
Thanks for confirming this worked for you. The default will get set back to 60 seconds in 3006.7
.