[BUG] Unable to connect minion to master - strange timeout
Hi everyone,
I've been dealing with a strange problem for a few weeks now, communication problem between a minion and the master. Communication timeout, while with a similar server it goes through ! Let me explain:
- My SALT MASTER server is in 192.168.1.100. Debian 11, Salt Master version 3005.
- My minion1 is on a different network in 192.168.10.0 /24. Debian 10.11, minion in version 3005. Minion1 has no problem to communicate with the Salt master, everything is ok.
- Minion2 is on the same network as Minion1. Debian 10.12, Salt Minion in 3005 too. And impossible to communicate with the master ...
What I've done so far:
- the network part is not the problem I think, I did pings/traceroute/nc at all times between the salt master and the minion2 and no worries. Moreover the minion1 is on the same network as the minion2 and has no problem
- I deleted on the master side the minion that is causing problems, and I deleted everything on the minion2 related to salt (configuration, packages etc). I reinstalled everything and redid the configuration as for the minion1, it does not change anything.
- Strange things: the minion2 communicates well with the salt master to register, I receive the key on the master. But then nothing, no more communication possible
- Second strange thing: I have exactly the same behavior if I deploy another server from scratch. Communication timeout
From Minion2 :
$ salt-call pillar.items
Unable to sign_in to master: Attempt to authenticate with the salt master failed with timeout error
From minion2 and /var/log/salt/minion file :
2022-09-09 16:09:25,573 [salt.minion :1143][ERROR ][877] Error while bringing up minion for multi-master. Is master at salt.xxx.org responding?
Again from minion2
[DEBUG ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_1a58a185c3_pub.ipc
[DEBUG ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_1a58a185c3_pull.ipc
[DEBUG ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG ] Closing AsyncReqChannel instance
[ERROR ] Error while bringing up minion for multi-master. Is master at salt.ozerim.fr responding?
From master log, nothing.
I don't know where to turn to correct this :(
If anyone has an idea, I'm interested!
Thanks !
Hi there! Welcome to the Salt Community! Thank you for making your first contribution. We have a lengthy process for issues and PRs. Someone from the Core Team will follow up as soon as possible. In the meantime, here’s some information that may help as you continue your Salt journey. Please be sure to review our Code of Conduct. Also, check out some of our community resources including:
- Community Wiki
- Salt’s Contributor Guide
- Join our Community Slack
- IRC on LiberaChat
- Salt Project YouTube channel
- Salt Project Twitch channel
There are lots of ways to get involved in our community. Every month, there are around a dozen opportunities to meet with other contributors and the Salt Core team and collaborate in real time. The best way to keep track is by subscribing to the Salt Community Events Calendar. If you have additional questions, email us at [email protected]. We’re glad you’ve joined our community and look forward to doing awesome things with you!
We have about 20 minions in a test setup. And while saltstack was running pretty stable at the beginning, we now see more and more minions being unreachable from the salt master and report the same error mentioned by @GeraldHOST1 when running salt-call on the minion itself.
Master is running saltstack v3005 and minions are running v3004.2 and v3005.
When (re-)starting the salt-master, we see these errors related to zeromq:
2022-10-17 19:53:32,948 [salt.utils.parsers:1060][WARNING ][9507] Master received a SIGTERM. Exiting.
2022-10-17 19:53:35,529 [salt.transport.zeromq:328 ][ERROR ][10380] ReqServer clients tcp://0.0.0.0:4506
2022-10-17 19:53:35,543 [salt.transport.zeromq:330 ][ERROR ][10380] ReqServer workers ipc:///var/run/salt/master/workers.ipc
2022-10-17 19:53:36,716 [tornado.general :435 ][ERROR ][10374] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 673, in monitor_callback
evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
2022-10-17 19:53:36,718 [tornado.general :462 ][ERROR ][10374] Uncaught exception in zmqstream callback
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 673, in monitor_callback
evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
2022-10-17 19:53:36,719 [tornado.application:640 ][ERROR ][10374] Exception in callback (<zmq.sugar.socket.Socket object at 0x7f8e806d9b40>, <function wrap.<locals>.null_wrapper at 0x7f8e806ffe18>)
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 890, in start
handler_func(fd_obj, events)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 673, in monitor_callback
evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
While I was checking for a last time whether my master and minion were up to date, I spotted the update to saltstack v3005.1. Quickly gave it a try but the issue persists.
We're seeing the exact same problem on 3005.1, even with a configuration file as minimal as the following (and starting the salt master with an empty /home/salt/sockets):
interface: "::"
ipv6: True
sock_dir: /home/salt/sockets
I seeing what I believe is the same issue, just setup a dual masters and a few clients in a test lab.
salt --version
salt 3005.1
Happen to be on Debian 11.6
Seeing very similary issue with salt 3005.1 on rhel9.1
Same happen on salt 3005.1 Ubuntu 18.x
Workaround : is downgrade to 3004.2
Same problem with 3005.1 classic on Debian 11
Seems like I stumbled upon same issue with Salt 3007.0
Unable to sign_in to master: Attempt to authenticate with the salt master failed with timeout error
Minion was suceffsull authenticated with master, key was accepted. When minion runs it contacts master and I see this in master log:
2024-03-20 22:16:55,137 [salt.channel.server:298 ][INFO ][3875451] Authentication request from gj-dev-minimal
2024-03-20 22:16:55,137 [salt.channel.server:574 ][INFO ][3875451] Authentication accepted from gj-dev-minimal
When I set log_level to debug I see some communication going on:
024-03-20 22:41:27,257 [salt.transport.zeromq:661 ][DEBUG ][3881092] ZeroMQ event: {'event': <Event.ACCEPTED: 32>, 'value': <Event.ACCEPTED|BIND_FAILED|LISTENING: 56>, 'endpoint': b'tcp://10.0.10.1:4505', 'description': 'EVENT_ACCEPTED'}
2024-03-20 22:41:27,296 [salt.transport.zeromq:661 ][DEBUG ][3881092] ZeroMQ event: {'event': <Event.HANDSHAKE_SUCCEEDED: 4096>, 'value': <Event.0: 0>, 'endpoint': b'tcp://10.0.10.1:4505', 'description': 'EVENT_HANDSHAKE_SUCCEEDED'}
2024-03-20 22:41:27,368 [salt.transport.zeromq:661 ][DEBUG ][3881092] ZeroMQ event: {'event': <Event.ACCEPTED: 32>, 'value': <Event.ACCEPTED|BIND_FAILED|LISTENING|CONNECTED: 57>, 'endpoint': b'tcp://10.0.10.1:4505', 'description': 'EVENT_ACCEPTED'}
2024-03-20 22:41:27,369 [salt.transport.zeromq:661 ][DEBUG ][3881092] ZeroMQ event: {'event': <Event.HANDSHAKE_SUCCEEDED: 4096>, 'value': <Event.0: 0>, 'endpoint': b'tcp://10.0.10.1:4505', 'description': 'EVENT_HANDSHAKE_SUCCEEDED'}
2024-03-20 22:41:27,572 [salt.transport.zeromq:661 ][DEBUG ][3881092] ZeroMQ event: {'event': <Event.ACCEPTED: 32>, 'value': <Event.ACCEPTED|BIND_FAILED|LISTENING|CONNECT_DELAYED: 58>, 'endpoint': b'tcp://10.0.10.1:4505', 'description': 'EVENT_ACCEPTED'}
2024-03-20 22:41:27,572 [salt.transport.zeromq:661 ][DEBUG ][3881092] ZeroMQ event: {'event': <Event.HANDSHAKE_SUCCEEDED: 4096>, 'value': <Event.0: 0>, 'endpoint': b'tcp://10.0.10.1:4505', 'description': 'EVENT_HANDSHAKE_SUCCEEDED'}
2024-03-20 22:41:28,270 [salt.transport.zeromq:661 ][DEBUG ][3881092] ZeroMQ event: {'event': <Event.DISCONNECTED: 512>, 'value': <Event.ACCEPTED|BIND_FAILED|LISTENING: 56>, 'endpoint': b'tcp://10.0.10.1:4505', 'description': 'EVENT_DISCONNECTED'}
And I see that pillars are processed on master:
2024-03-20 22:41:28,561 [salt.template :23 ][PROFILE ][3881110] Time (in seconds) to render '/srv/pillar/role/dev.sls' using 'jinja' renderer: 0.0007266998291015625
2024-03-20 22:41:28,561 [salt.template :116 ][DEBUG ][3881110] Rendered data from file: /srv/pillar/role/dev.sls:
Sending event: tag = minion/refresh/gj-dev-minimal; data = {'Minion data cache refresh': 'gj-dev-minimal', '_stamp': '2024-03-20T21:41:28.609413'}
But then I see also lots of these repeatedly:
2024-03-20 22:44:39,330 [salt.channel.server:298 ][INFO ][3881110] Authentication request from gj-dev-minimal
2024-03-20 22:44:39,331 [salt.channel.server:574 ][INFO ][3881110] Authentication accepted from gj-dev-minimal
2024-03-20 22:44:39,331 [salt.crypt :216 ][DEBUG ][3881110] salt.crypt.get_rsa_pub_key: Loading public key
2024-03-20 22:44:39,347 [salt.utils.event :823 ][DEBUG ][3881110] Sending event: tag = salt/auth; data = {'result': True, 'act': 'accept', 'id': 'gj-dev-minimal', 'pub': '-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsl+boByshPWLWA/G29kg\nWi3D63Aj2KQMBJ8N1bGPsuadiGvpCIjJwXTAveIVOGVFZybSWWubqdgJbnQ9eZSM\nyTejzX0hlts6iIZ1vxFX3hHvTaAwqSPCpiLM9meD/hWoejx4A89Nk0peK68cMJz4\nVgfh9cQlJt9OXHQZB+hgWKJ/58thLcSLxiuZQXoh6UBxzjlBxMCusTSJl8L5cg7T\niwJ4SPzsNF7v3y+EhmCER+2NShsKAhKVyK22iml6Fr/vQq3+17ygD3brCJOd3R5B\nHrmzg2eiwksZaBlhWRqvIwOrMKRyGKrH8MmmMjvsYxt2eFkwVDyJTilrrjPXZ0v2\nBwIDAQAB\n-----END PUBLIC KEY-----', '_stamp': '2024-03-20T21:44:39.347736'}
2024-03-20 22:44:39,348 [salt.crypt :208 ][DEBUG ][3881110] salt.crypt.get_rsa_key: Loading private key
2024-03-20 22:44:39,348 [salt.crypt :239 ][DEBUG ][3881110] salt.crypt.sign_message: Signing message.
2024-03-20 22:44:39,348 [salt.utils.reactor:85 ][DEBUG ][3881096] Gathering reactors for tag salt/auth
2024-03-20 22:44:44,339 [salt.channel.server:298 ][INFO ][3881101] Authentication request from gj-dev-minimal
2024-03-20 22:44:44,340 [salt.channel.server:574 ][INFO ][3881101] Authentication accepted from gj-dev-minimal
2024-03-20 22:44:44,340 [salt.crypt :216 ][DEBUG ][3881101] salt.crypt.get_rsa_pub_key: Loading public key
2024-03-20 22:44:44,357 [salt.utils.event :823 ][DEBUG ][3881101] Sending event: tag = salt/auth; data = {'result': True, 'act': 'accept', 'id': 'gj-dev-minimal', 'pub': '-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsl+boByshPWLWA/G29kg\nWi3D63Aj2KQMBJ8N1bGPsuadiGvpCIjJwXTAveIVOGVFZybSWWubqdgJbnQ9eZSM\nyTejzX0hlts6iIZ1vxFX3hHvTaAwqSPCpiLM9meD/hWoejx4A89Nk0peK68cMJz4\nVgfh9cQlJt9OXHQZB+hgWKJ/58thLcSLxiuZQXoh6UBxzjlBxMCusTSJl8L5cg7T\niwJ4SPzsNF7v3y+EhmCER+2NShsKAhKVyK22iml6Fr/vQq3+17ygD3brCJOd3R5B\nHrmzg2eiwksZaBlhWRqvIwOrMKRyGKrH8MmmMjvsYxt2eFkwVDyJTilrrjPXZ0v2\nBwIDAQAB\n-----END PUBLIC KEY-----', '_stamp': '2024-03-20T21:44:44.357507'}
2024-03-20 22:44:44,358 [salt.crypt :208 ][DEBUG ][3881101] salt.crypt.get_rsa_key: Loading private key
2024-03-20 22:44:44,358 [salt.crypt :239 ][DEBUG ][3881101] salt.crypt.sign_message: Signing message.
2024-03-20 22:44:44,358 [salt.utils.reactor:85 ][DEBUG ][3881096] Gathering reactors for tag salt/auth
2024-03-20 22:44:49,354 [salt.channel.server:298 ][INFO ][3881109] Authentication request from gj-dev-minimal
2024-03-20 22:44:49,354 [salt.channel.server:574 ][INFO ][3881109] Authentication accepted from gj-dev-minimal
2024-03-20 22:44:49,355 [salt.crypt :216 ][DEBUG ][3881109] salt.crypt.get_rsa_pub_key: Loading public key
2024-03-20 22:44:49,371 [salt.utils.event :823 ][DEBUG ][3881109] Sending event: tag = salt/auth; data = {'result': True, 'act': 'accept', 'id': 'gj-dev-minimal', 'pub': '-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsl+boByshPWLWA/G29kg\nWi3D63Aj2KQMBJ8N1bGPsuadiGvpCIjJwXTAveIVOGVFZybSWWubqdgJbnQ9eZSM\nyTejzX0hlts6iIZ1vxFX3hHvTaAwqSPCpiLM9meD/hWoejx4A89Nk0peK68cMJz4\nVgfh9cQlJt9OXHQZB+hgWKJ/58thLcSLxiuZQXoh6UBxzjlBxMCusTSJl8L5cg7T\niwJ4SPzsNF7v3y+EhmCER+2NShsKAhKVyK22iml6Fr/vQq3+17ygD3brCJOd3R5B\nHrmzg2eiwksZaBlhWRqvIwOrMKRyGKrH8MmmMjvsYxt2eFkwVDyJTilrrjPXZ0v2\nBwIDAQAB\n-----END PUBLIC KEY-----', '_stamp': '2024-03-20T21:44:49.371759'}
2024-03-20 22:44:49,372 [salt.crypt :208 ][DEBUG ][3881109] salt.crypt.get_rsa_key: Loading private key
2024-03-20 22:44:49,372 [salt.crypt :239 ][DEBUG ][3881109] salt.crypt.sign_message: Signing message.
2024-03-20 22:44:49,372 [salt.utils.reactor:85 ][DEBUG ][3881096] Gathering reactors for tag salt/auth
And then nothing else logged on master and minion timeouts.
Found a cause and solution: We use OpenVPN to connect servers with minions to server with master. It worked just fine with our old server but stopped working after recent upgrade of master server to latest Debian. It seems that for some reason OpenVPN connection over UDP caused salt conenction to lose packets. After switching to TCP in OpenVPN it works
Interesting... understandable UDP vs TCP... but then the question is, should even UDP have reasonable chance of success? All depends on where the issue originates from I guess.
I'm having the same issue. Can it be related to MTU settings?
@nagylzs That was one possible cause we investigated and tested and in our case it was not the cause.
Same issue with salt 3005.5 / debian 10.13.
I see that when I start jobs from master (ie. salt '*' test.ping) - jid gets assigned, but that never reaches minion.
When I start anything on the minion (ie. salt-call state.apply), minion succesfully connects to master and works as expected.
I'm getting similar error logs:
2024-05-08 02:01:07,082 [salt.cli.daemons :83 ][INFO ][12021] Setting up the Salt Master
2024-05-08 02:01:07,365 [salt.daemons.masterapi:220 ][INFO ][12021] Preparing the root key for local communication
2024-05-08 02:01:07,373 [salt.cli.daemons :83 ][INFO ][12021] Starting up the Salt Master
2024-05-08 02:01:07,400 [salt.master :677 ][INFO ][12021] salt-master is starting as user 'root'
2024-05-08 02:01:07,400 [salt.master :547 ][INFO ][12021] Current values for max open files soft/hard setting: 100000/100000
2024-05-08 02:01:07,403 [salt.master :704 ][INFO ][12021] Creating master process manager
2024-05-08 02:01:07,403 [salt.master :708 ][INFO ][12021] Creating master publisher process
2024-05-08 02:01:08,095 [salt.master :714 ][INFO ][12021] Creating master event publisher process
2024-05-08 02:01:08,116 [salt.transport.zeromq:746 ][INFO ][12090] Starting the Salt Publisher on tcp://0.0.0.0:4505
2024-05-08 02:01:08,117 [salt.transport.zeromq:749 ][INFO ][12090] Starting the Salt Puller on ipc:///var/run/salt/master/publish_pull.ipc
2024-05-08 02:01:08,151 [salt.master :738 ][INFO ][12021] Creating master maintenance process
2024-05-08 02:01:08,154 [salt.master :777 ][INFO ][12021] Creating master request server process
2024-05-08 02:01:08,257 [salt.transport.zeromq:327 ][INFO ][12097] Setting up the master communication server
2024-05-08 02:01:08,257 [salt.transport.zeromq:328 ][ERROR ][12097] ReqServer clients tcp://0.0.0.0:4506
2024-05-08 02:01:08,258 [salt.transport.zeromq:330 ][ERROR ][12097] ReqServer workers ipc:///var/run/salt/master/workers.ipc
2024-05-08 02:01:08,597 [tornado.general :435 ][ERROR ][12090] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 677, in monitor_callback
evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
2024-05-08 02:01:08,601 [tornado.general :462 ][ERROR ][12090] Uncaught exception in zmqstream callback
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 677, in monitor_callback
evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
2024-05-08 02:01:08,601 [tornado.application:640 ][ERROR ][12090] Exception in callback (<zmq.sugar.socket.Socket object at 0x7ffabfdf6b40>, <function wrap.<locals>.null_wrapper at 0x7ffabaf1e
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 890, in start
handler_func(fd_obj, events)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/lib/python3/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
return fn(*args, **kwargs)
File "/usr/lib/python3/dist-packages/salt/transport/zeromq.py", line 677, in monitor_callback
evt["description"] = self.event_map[evt["event"]]
KeyError: 4096
2024-05-08 02:01:14,281 [salt.transport.zeromq:416 ][INFO ][12101] Worker binding to socket ipc:///var/run/salt/master/workers.ipc
2024-05-08 02:01:14,377 [salt.transport.zeromq:416 ][INFO ][12099] Worker binding to socket ipc:///var/run/salt/master/workers.ipc
2024-05-08 02:01:14,399 [salt.transport.zeromq:416 ][INFO ][12108] Worker binding to socket ipc:///var/run/salt/master/workers.ipc
2024-05-08 02:01:14,414 [salt.transport.zeromq:416 ][INFO ][12098] Worker binding to socket ipc:///var/run/salt/master/workers.ipc
2024-05-08 02:01:14,492 [salt.transport.zeromq:416 ][INFO ][12109] Worker binding to socket ipc:///var/run/salt/master/workers.ipc
Version report:
Salt Version:
Salt: 3005.5
Dependency Versions:
cffi: Not Installed
cherrypy: Not Installed
dateutil: 2.7.3
docker-py: Not Installed
gitdb: 2.0.5
gitpython: 2.1.11
Jinja2: 2.10
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 0.5.6
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: Not Installed
pycrypto: 2.6.1
pycryptodome: 3.6.1
pygit2: Not Installed
Python: 3.7.3 (default, Mar 23 2024, 16:12:05)
python-gnupg: Not Installed
PyYAML: 3.13
PyZMQ: 17.1.2
smmap: 2.0.5
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.1
System Versions:
dist: debian 10 buster
locale: UTF-8
machine: x86_64
release: 4.19.0-21-amd64
system: Linux
version: Debian GNU/Linux 10 buster
```