salt
salt copied to clipboard
[BUG] 3005 broke syndication
Description 3005 broke syndication.
Setup Upgrade an existing 3004.2 environment to 3005. Watch syndicated masters no longer connect to the syndication master.
Expected behavior Syndication to be actually tested before releases.
@ebusto
Can you please include you configs and some logs from the master and syndics?
Servers:
- 2 MoM (mom1,mom2)
- 2 Syndic (syndic1,syndic2)
Upgraded from 3004.2 to 3005. (Debian 11 package from Salt's repositories)
MoM runs master and minion, and sharing /etc/salt/pki/master
master.conf: order_masters: True
minion.conf: master: ['mom1','mom2']
Syndic runs master, minion and syndic, and sharing /etc/salt/pki/master
master.conf: syndic_master: ['mom1','mom2']
and syndic_forward_all_events: True
minion.conf: master: ['syndic1','syndic2']
syndic connect to MoM, but also to the local master, and overflood it with start event like
salt/syndic/syndic2/start {
"_stamp": "2022-08-30T21:52:39.188723",
"cmd": "_minion_event",
"data": "Syndic syndic2 started at Tue Aug 30 23:52:39 2022",
"id": "syndic2",
"pretag": null,
"tag": "salt/syndic/syndic2/start"
}
[..]
syndic/syndic2/salt/syndic/syndic2/start {
"_stamp": "2022-08-30T21:52:39.384410",
"cmd": "_minion_event",
"data": "Syndic syndic2 started at Tue Aug 30 23:52:39 2022",
"id": "syndic2",
"pretag": null,
"tag": "salt/syndic/syndic2/start"
}
[..]
syndic/syndic2/syndic/syndic2/syndic/syndic2/syndic/syndic2/syndic/syndic2/salt/syndic/syndic2/start {
"_stamp": "2022-08-30T21:52:41.948653",
"cmd": "_minion_event",
"data": "Syndic syndic2 started at Tue Aug 30 23:52:39 2022",
"id": "syndic2",
"pretag": null,
"tag": "salt/syndic/syndic2/start"
}
on the MoM event bus, on syndic startup, I have
salt/auth {
"_stamp": "2022-08-30T21:52:38.550539",
"act": "accept",
"id": "syndic2",
"pub": "-----BEGIN PUBLIC KEY-----XXXXXXXXXXXX\n-----END PUBLIC KEY-----\n",
"result": true
}
minion/refresh/syndic2 {
"Minion data cache refresh": "syndic2",
"_stamp": "2022-08-30T21:52:38.870116"
}
Nothing else
I am not sure if it's related to the syndic setup. However after upgrading the "syndic" server stops working correctly after a couple of minutes with the errors seen below in the log. When the error occurs it is not possible to contact minions from the master using the "salt" command. Running "salt-call" on the minions works fine. In contrast to this the "mom" server is running fine and can contact its own directly connected minions.
What might be the reason for this problem?
I am running Debian 10 with packages from the official Salt repository.
22-08-30 17:04:44,564 [tornado.general :435 ][ERROR ][7840] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 673, in monitor_callback
evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
2022-08-30 17:04:44,565 [tornado.general :462 ][ERROR ][7840] Uncaught exception in zmqstream callback
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 673, in monitor_callback
evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
2022-08-30 17:04:44,565 [tornado.application:640 ][ERROR ][7840] Exception in callback (<zmq.sugar.socket.Socket object at 0x7f7fe0183b40>, <function wrap.<locals>.null_wrapper at 0x7f7fdb323b70>)
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 890, in start
handler_func(fd_obj, events)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 673, in monitor_callback
evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
Since Debian 10 has an old version of python3-zmq this might be similar to #62550 and https://github.com/saltstack/salt/issues/62550#issuecomment-1230621341 might be a workaround.
@d--j Tested pyzmq for hang issue on Debian 10 and saw no problem with OS provided python3-zmq, see https://github.com/saltstack/salt/issues/62550#issuecomment-1233293961
@d--j Tested pyzmq for hang issue on Debian 10 and saw no problem with OS provided python3-zmq, see #62550 (comment)
At least @tkaehn is running Debian 10, not Debian 11. Debian 10 has python3-zmq version 17.1.2. See https://packages.debian.org/buster/python3-zmq
Downgrading the Salt packages to 3004.2+ds-1 on the "syndic" server solved the issue. The "mom" system is still running on 3005+ds-1. Both systems on Debian 10.
After upgrading both servers from Debian 10 to 11 and both to Salt 3005 everything is running stable again. So the problem is solved for me. However it seems that 3005 introduced an incompatibility with Debian 10 that should be fixed.
I have problem with Syndic on Salt 3005 and Rocky Linux 8.6. It went into infinite loop of "minion start" events. This causes Salt Master to stop behaving right after few seconds of start…
Event Bus is getting spammed all over with this weird events and Master just stops responding right after Syndic startup after it spams event bus with all of this:
syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/salt/minion/my-salt-master/start {
"_stamp": "2022-09-10T19:58:12.007623",
"cmd": "_minion_event",
"data": "Minion my-salt-master started at Sat Sep 10 19:57:45 2022",
"id": "my-salt-master",
"pretag": null,
"tag": "salt/minion/my-salt-master/start"
}
syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/salt/minion/my-salt-master/start {
"_stamp": "2022-09-10T19:58:12.013383",
"cmd": "_minion_event",
"data": "Minion my-salt-master started at Sat Sep 10 19:57:45 2022",
"id": "my-salt-master",
"pretag": null,
"tag": "salt/minion/my-salt-master/start"
}
syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/salt/minion/my-salt-master/start {
"_stamp": "2022-09-10T19:58:12.014113",
"cmd": "_minion_event",
"data": "Minion my-salt-master started at Sat Sep 10 19:57:45 2022",
"id": "my-salt-master",
"pretag": null,
"tag": "salt/minion/my-salt-master/start"
}
syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/syndic/my-salt-master/salt/minion/my-salt-master/start {
"_stamp": "2022-09-10T19:58:12.014525",
"cmd": "_minion_event",
"data": "Minion my-salt-master started at Sat Sep 10 19:57:45 2022",
"id": "my-salt-master",
"pretag": null,
"tag": "salt/minion/my-salt-master/start"
}
@ebusto Would it be possible to upgrade the version of pyzmq on Debian 10 system which is exhibiting the problems with syndic, a version of pyzmq >= 20.0.0. The following example command should work:
python3 -m pip install -U "pyzmq>=20.0.0"
Been sure to restart salt-master and salt-minion after installing the update pyzmq, just to ensure the update is picked up.
On the Debian 10 I tested the command, pyzmq v23.2.1 was installed. Checking a fix in pyzmq thread handling that was fixed in pyzmq v20.0.0 that resolved some other issues since with pyzmq used on Ubuntu 20.04.
bump ^ can anyone confirm the issue is related to the version of pyzmq as @dmurphy18 outlined.
Just checking in again. Can anyone confirm the pyzmq upgrade @dmurphy18 outlined solves this issue?
This did not help me as im on PyZMQ: 24.0.1
and still seeing issues with comms from syndic node to master.
@satellite-no can you post your entire versions report: salt --versions-report
and @tkaehn can you post your versions report for your working instance?
I'm guessing this is a dependency issue since its working on a OS that has different version dependencies.
Please see the version reports below. As I've said before everything is working again (on the surface). However in the log files I get tons of messages like the following. This didn't happen before 3005:
2022-09-29 10:43:30,061 [salt.loaded.int.returner.local_cache:147 ][ERROR ][1567775] An extra return was detected from minion [...]
When targeting a lot of minions (~100) at the same time it happens occasionally that some of them do not return. This happened before 3005 as well. Is it possible that increasing worker_threads (currently 32) could solve this problem?
mom# salt --versions-report
Salt Version:
Salt: 3005
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.8.1
docker-py: Not Installed
gitdb: 4.0.5
gitpython: 3.1.14
Jinja2: 2.11.3
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.0
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: Not Installed
pycryptodome: 3.9.7
pygit2: Not Installed
Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
python-gnupg: Not Installed
PyYAML: 5.3.1
PyZMQ: 20.0.0
smmap: 4.0.0
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: debian 11 bullseye
locale: utf-8
machine: x86_64
release: 5.10.0-18-amd64
system: Linux
version: Debian GNU/Linux 11 bullseye
syndic# salt --versions-report
Salt Version:
Salt: 3005
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.8.1
docker-py: Not Installed
gitdb: 4.0.5
gitpython: 3.1.14
Jinja2: 2.11.3
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.0
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: Not Installed
pycryptodome: 3.9.7
pygit2: Not Installed
Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
python-gnupg: Not Installed
PyYAML: 5.3.1
PyZMQ: 20.0.0
smmap: 4.0.0
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: debian 11 bullseye
locale: utf-8
machine: x86_64
release: 5.10.0-18-amd64
system: Linux
version: Debian GNU/Linux 11 bullseye
Sure thing!
Master Versions Report
[root@salt deepwatch]# salt --versions-report
Salt Version:
Salt: 3005
Dependency Versions:
cffi: 1.15.1
cherrypy: Not Installed
dateutil: Not Installed
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.10
libgit2: 1.5.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 0.5.6
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.6.1
pygit2: 1.10.0
Python: 3.7.10 (default, Jun 3 2021, 00:02:01)
python-gnupg: Not Installed
PyYAML: 4.2
PyZMQ: 24.0.1
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: amzn 2
locale: UTF-8
machine: x86_64
release: 5.10.135-122.509.amzn2.x86_64
system: Linux
version: Amazon Linux 2
Syndic Versions Report
[root@syndic ec2-user]# salt --versions-report
Salt Version:
Salt: 3005
Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.0
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: Not Installed
Python: 3.9.13 (main, Aug 23 2022, 18:33:16)
python-gnupg: 0.4.8
PyYAML: 5.4.1
PyZMQ: 23.2.0
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: amzn 2
locale: utf-8
machine: x86_64
release: 5.10.135-122.509.amzn2.x86_64
system: Linux
version: Amazon Linux 2
This is the error I see on the master
syndic:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20220930141808737696
Thanks. The only thing that stands out is there are different pyzmq versions on your syndic and master. @satellite-no would you mind upgrading your pyzmq version on your syndic, restart the services and see if that makes a difference?
@Ch3LL So silly question, I'm not sure how to do that LOL.. According to my versions-report Salt is using python3.9
which I'm not sure where Salt installed. If I look at the python version of the OS its Python 3.7.10
and the pyzmq is 20.0.0
.
This is an yum install of salt so if you know where salt installs its python version let me know (I attempted a find but with no avail).
Did you install the onedir packages or classic packages?
I used the onedir package.
okay so your master is classic packages and syndic is onedir?
If you want to upgrade a package on onedir you can use the salt-pip
cli tool to upgrade the package.
This is not good. Why is syndic still broken in 3005.1? Is there no movement on fixing this bug?
I've tried to upgrade to the onedir packages to see if this makes any difference for me. However the log files are still flooded with messages like this potentially hiding real errors:
salt-master[275557]: [ERROR ] An extra return was detected from minion ...
And also like this:
salt-minion[2539]: [ERROR ] Future <salt.ext.tornado.concurrent.Future object at 0x7f561c07f370> exception was never retrieved: Traceback (most recent
call last):
[...]
salt-minion[2539]: salt.exceptions.SaltReqTimeoutError: Message timed out
I am very disappointed that a new release breaks so much and that there seems to be no quick solution.
What is your advise? Downgrading to 3004?
From the documentation:
Running commands on remote systems is the core function of Salt.
Salt can execute multiple commands across thousands of systems in
seconds with a single execution.
How can this work if targeting around 100 minions out of a few hundred results in occasional timeouts? Running again all minions do respond. This happened with 3004 as well.
I'm unsure if this is the same issue we're facing, but after upgrading to 3005, our systems loadavg goes haywire if the masters local minion is connecting to itself, and the syndic is connecting to a different master.
If the minion connects to the same master as the syndic, there's no issues... This happens on Centos 7
Salt Version: Salt: 3005.1
Dependency Versions: cffi: 1.14.6 cherrypy: unknown dateutil: 2.8.1 docker-py: Not Installed gitdb: Not Installed gitpython: Not Installed Jinja2: 3.1.0 libgit2: Not Installed M2Crypto: Not Installed Mako: Not Installed msgpack: 1.0.2 msgpack-pure: Not Installed mysql-python: Not Installed pycparser: 2.21 pycrypto: Not Installed pycryptodome: 3.9.8 pygit2: Not Installed Python: 3.9.15 (main, Nov 8 2022, 03:44:36) python-gnupg: 0.4.8 PyYAML: 5.4.1 PyZMQ: 23.2.0 smmap: Not Installed timelib: 0.2.4 Tornado: 4.5.3 ZMQ: 4.3.4
System Versions: dist: centos 7 Core locale: utf-8 machine: x86_64 release: 3.10.0-1160.76.1.el7.x86_64 system: Linux version: CentOS Linux 7 Core
We are experiencing all of the issues reported here, and on a hunch from @Jookadin's comment we moved the minion on the syndic to be connected to the MoM and the problems have gone away.
I don't know if this helps, but on our syndic when salt-minion is connected to the master on the syndic it goes into a loop with the following error:
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/minion.py", line 2700, in handle_event
tries=_minion.opts["return_retry_tries"],
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 295, in send
ret = yield self._crypted_transfer(load, timeout=timeout, raw=raw)
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 252, in _crypted_transfer
ret = yield _do_transfer()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/channel/client.py", line 235, in _do_transfer
timeout=timeout,
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 914, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python3.6/site-packages/salt/transport/zeromq.py", line 624, in send
recv = yield future
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/usr/lib/python3.6/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out
During this period the salt-syndic process is throwing the following error constantly:
[ERROR ] Unable to connect pusher: Stream is closed
Worth noting we are also seeing the behavior reported above with the logs littered with messages about an extra return from minions:
[ERROR ] An extra return was detected from minion
For those interested/able, in salt/minion.py
, in _send_req_async
try changing
with salt.utils.event.get_event(
"minion", opts=self.opts, listen=False
) as event:
ret = yield event.fire_event_async(
load, "__master_req_channel_payload", timeout=timeout
)
to
with salt.transport.client.AsyncReqChannel.factory(self.opts) as channel:
log.warning('Sending this load %r', load)
ret = yield channel.send(
load, timeout=timeout, tries=self.opts["return_retry_tries"]
)
It's probably not the right fix, but at least in my limited testing it appears to work, at least if the problem is just the syndic returning to the MoM.
You can verify that's your problem by checking on one of your minions off the syndic master:
ls /tmp/test.txt
Assuming there's nothing there, from your MoM, given a minion of bob:
salt bob cmd.run 'touch /tmp/test.txt'
Then back on bob, ls /tmp/test.txt
again. If it's there, your syndic simply isn't returning to the MoM. If it's not, then your MoM might not be connecting to the syndic -- we've seen some wonkiness happen there, looks like it may have been the above 4096
issue. We haven't fully tracked it down but our current theory on that is that at some point the connection to the minion(s) crash, and it never gets brought back up :thinking:. Simply restarting your master should fix that, though.
If you do try my above change, let me know how it works!
I don't know if this helps, but on our syndic when salt-minion is connected to the master on the syndic it goes into a loop with the following error:
seeing exactly this problem on ubuntu22.04 running 3005.1 installed from repo.saltproject.io. master, syndic and minion running on a single device. Worked fine on 3004.2 on ubuntu20.04.
Salt version report
Salt Version:
Salt: 3005.1
Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.0
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: Not Installed
Python: 3.9.15 (main, Nov 8 2022, 03:42:58)
python-gnupg: 0.4.8
PyYAML: 5.4.1
PyZMQ: 23.2.0
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: ubuntu 22.04 jammy
locale: utf-8
machine: x86_64
release: 5.15.0-53-generic
system: Linux
version: Ubuntu 22.04 jammy
#63257 should sort this out :+1:
Fixed by https://github.com/saltstack/salt/pull/63382