[BUG] 3007.4 salt.utils.process timeout
Description Since upgrade to 3007.4 the majority of my salt minions have disconnected from the master.
Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Please be as specific as possible and give set-up details.
- [ ] on-prem machine
- [x] VM (Virtualbox, KVM, etc. please specify)
- [ ] VM running on a cloud service, please be explicit and add details
- [ ] container (Kubernetes, Docker, containerd, etc. please specify)
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [ ] classic packaging
- [x] onedir packaging
- [ ] used bootstrap to install
Steps to Reproduce the behavior Upgrade to latest salt verion
Expected behavior Minions should stay connected to the master.
Versions Report
Salt Version: Salt: 3007.4
Python Version: Python: 3.10.17 (main, Jun 9 2025, 20:41:48) [GCC 11.2.0]
Additional context
Minion Log:
2025-06-16 11:43:43,324 [salt.utils.process:1004][ERROR ][722938] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250616094243035566)' was caught: Traceback (most recent call last): File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 999, in wrapped_run_func return run_func() File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1920, in _target run_func(minion_instance, opts, data) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1917, in run_func return Minion._thread_return(minion_instance, opts, data) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2153, in _thread_return minion_instance._return_pub(ret) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2384, in _return_pub ret_val = self._send_req_sync(load, timeout=timeout) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1647, in _send_req_sync raise TimeoutError("Request timed out") TimeoutError: Request timed out
I'm second to this , happen on 3006.12 (master) and minion (3006.12)
2025-06-15 10:04:03,124 [salt.utils.process:1004][ERROR ][4028823] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250615020302931331)' was caught:
Traceback (most recent call last):
then from master not able to test.ping to minion
it's happen randomly , will keep monitor , btw I haven't test on the other minion version yet.
Also seeing this across hundreds of minions, all since upgrading minions to 3006.12 and connected to 3006.12 master.
Minion log
2025-06-16 08:37:28,539 [salt.minion :1078][ERROR ][3864977] Error dispatching event. message authentication failed
2025-06-16 08:37:49,187 [salt.minion :1078][ERROR ][3864977] Error dispatching event. message authentication failed
2025-06-16 08:38:27,262 [salt.utils.process:1004][ERROR ][1202995] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250616073727015796)' was caught:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 999, in wrapped_run_func
return run_func()
File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1927, in _target
run_func(minion_instance, opts, data)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1921, in run_func
return Minion._thread_return(minion_instance, opts, data)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2157, in _thread_return
minion_instance._return_pub(ret)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2385, in _return_pub
ret_val = self._send_req_sync(load, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1650, in _send_req_sync
raise TimeoutError("Request timed out")
TimeoutError: Request timed out
Master logs successful authentication request:
2025-06-16 08:37:28,190 [salt.channel.server:420 ][INFO ][3228353] Authentication request from minion.host
2025-06-16 08:37:28,190 [salt.channel.server:705 ][INFO ][3228353] Authentication accepted from minion.host
All my 3006.12 minions normally connected to a 3006.12 master exhibit the same issue. Usually one 3006.12 minion starts throwing an salt.exceptions.AuthenticationError, then the part of the minion fleet running 3006.12 falls over within at most 10 minutes, usually within one minute.
However, the minions running an older Salt release stay connected without issue.
There are two workarounds:
- Restart all affeced minions by manually logging in via SSH
- Restart the master, minions will reconnect
Thanks, restarting the master worked.
But, an hour later I've seen a few minions disconnect again.
But, an hour later I've seen a few minions disconnect again.
I had the same thing happening, now I cleared the cache on my master and so far the problem seems to be solved.
for those seeing this issue, did you ensure the master was upgraded before minion? the latest release has a backward-incompatible minion change that requires master to be first upgraded.
for those seeing this issue, did you ensure the master was upgraded before minion? the latest release has a backward-incompatible minion change that requires master to be first upgraded.
Minions were restarted after master was upgraded (but most minions were upgraded before master).
It's pretty telling that the issues are with the new 3006 and 3007 minions - I started to roll out the new 3006.12 on the master and a few minions before encountering heavy issues and pausing the upgrades- trying a salt -G 'saltversion:3006.12' test.ping had 1 out of 5 minions responding, versus a salt -C 'not G@saltversion:3006.12' test.ping where everything was still connected correctly. Restarting the master did restore the connection to all 3006.12 minions, but it probably won't last
@SwimGeek can you please update the bug title to include 3006.12 as well to get additional visibility?
For me, the preceding error is in /opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py line 1617 (for Saltstack 3006.12). That's the function decrypt(). My guess is that the variable data has insufficient size, maybe even size zero.
Instrumentation of the affected line:
raise AuthenticationError(f"message authentication failed: len(mac_bytes)={len(mac_bytes)}, len(sig)={len(sig)}, len(data)={len(data)}")
I'm having troubles re-producing this behavior. Can someone please try the following change?
https://github.com/saltstack/salt/pull/68085/commits/460bce3a18fbf2e779fbb220e7b8bf5e2de82211
I patched one of my minion, I should see tomorrow if the error reappears.
I do patched on some of 3006.12 ,seems does the job. , will observe for about 24 hours.
I'm having troubles re-producing this behavior. Can someone please try the following change?
Looks like it work. Tested on our test environment. salt-minion-3007.4-0.x86_64
On our test env, we have 5 virtual machines. Two of them are master and client (salt-minion) and the three other are just salt-minion.
Before the patch, when running the following
salt '*' test.ping
salt '*' cmd.run "uname -a"
followed by a new test.ping
salt '*' test.ping
it raise the timeout error.
After the patch, running the following multiple time:
salt '*' test.ping
salt '*' cmd.run "uname -a"
followed by a new test.ping
salt '*' test.ping
no more timeout error yet.
But, an hour later I've seen a few minions disconnect again.
After some minions disconnected, I deleted the job cache on the master with a restart.
This worked for a while... but the next morning only 4 minions are connected (of 20+). Something is very broken.
I'm having troubles re-producing this behavior. Can someone please try the following change? 460bce3
Looks like it work. Tested on our test environment. salt-minion-3007.4-0.x86_64
On our test env, we have 5 virtual machines. Two of them are master and client (salt-minion) and the three other are just salt-minion.
Before the patch, when running the following
salt '' test.ping salt '' cmd.run "uname -a"
followed by a new test.ping
salt '*' test.ping
it raise the timeout error.
After the patch, running the following multiple time:
salt '' test.ping salt '' cmd.run "uname -a"
followed by a new test.ping
salt '*' test.ping
no more timeout error yet.
after awhile, it happened again
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1649, in _send_req_sync
raise TimeoutError("Request timed out")
TimeoutError: Request timed out
after awhile, it happened again
Same here.
heres my logs from a minion in debug when it started happening. (3007.4 minion)
after these Failed to authenticate message start appearing the cmd.run's or test.ping's will fail to the minion
2025-06-17 10:39:30,892 [salt.utils.schedule:1762][DEBUG ][2977289] schedule: Job __mine_interval was scheduled with jid_include, adding to cache (jid_include defaults to True)
2025-06-17 10:39:30,893 [salt.utils.schedule:1768][DEBUG ][2977289] schedule: Job __mine_interval was scheduled with a max number of 2
2025-06-17 10:39:30,893 [salt.utils.schedule:1792][INFO ][2977289] Running scheduled job: __mine_interval with jid 20250617093930893503
2025-06-17 10:39:30,897 [salt.utils.process:1167][DEBUG ][2977289] Subprocess Schedule(name=__mine_interval, jid=20250617093930893503) added
2025-06-17 10:39:30,949 [salt.loader.lazy :996 ][DEBUG ][2223591] The functions from module 'mine' are being loaded by dir() on the loaded module
2025-06-17 10:39:30,949 [salt.utils.lazy :98 ][DEBUG ][2223591] LazyLoaded mine.update
2025-06-17 10:39:30,950 [salt.utils.schedule:821 ][DEBUG ][2223591] schedule.handle_func: adding this job to the jobcache with data {'id': 'influx-grafana-dev.', 'fun': 'mine.update', 'fun_args': [], 'schedule': '__mine_interval', 'jid': '20250617093930893503', 'pid': 2223591}
2025-06-17 10:39:30,951 [salt.loader.lazy :996 ][DEBUG ][2223591] The functions from module 'config' are being loaded by dir() on the loaded module
2025-06-17 10:39:30,952 [salt.utils.lazy :98 ][DEBUG ][2223591] LazyLoaded config.merge
2025-06-17 10:39:30,952 [salt.loader.lazy :996 ][DEBUG ][2223591] The functions from module 'pillar' are being loaded by dir() on the loaded module
2025-06-17 10:39:30,953 [salt.utils.lazy :98 ][DEBUG ][2223591] LazyLoaded pillar.get
2025-06-17 10:39:30,954 [asyncio :54 ][DEBUG ][2223591] Using selector: EpollSelector
2025-06-17 10:39:30,956 [asyncio :54 ][DEBUG ][2223591] Using selector: EpollSelector
2025-06-17 10:39:30,958 [salt.transport.tcp:1623][DEBUG ][2223591] Publisher connecting to /var/run/salt/minion/minion_event_1cb681de76_pull.ipc
2025-06-17 10:39:30,961 [salt.minion :2822][DEBUG ][2977289] Minion of 'saltdirector.' is handling event tag '_minion_mine'
2025-06-17 10:39:30,962 [asyncio :54 ][DEBUG ][2977289] Using selector: EpollSelector
2025-06-17 10:39:30,962 [salt.crypt :677 ][DEBUG ][2977289] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'influx-grafana-dev.', 'tcp://IPADDRREMOVED:4506')
2025-06-17 10:39:30,966 [salt.minion :896 ][DEBUG ][2977289] Minion return retry timer set to 7 seconds (randomized)
2025-06-17 10:39:30,966 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:30,967 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:30,974 [salt.crypt :1693][DEBUG ][2977289] Failed to authenticate message
2025-06-17 10:39:31,011 [salt.crypt :1174][DEBUG ][2977289] Decrypting the current master AES key
2025-06-17 10:39:31,011 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,011 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,015 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,015 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,017 [salt.channel.client:371 ][DEBUG ][2977289] Closing AsyncReqChannel instance
2025-06-17 10:39:31,018 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,018 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,023 [salt.crypt :1693][DEBUG ][2977289] Failed to authenticate message
2025-06-17 10:39:31,023 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,024 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,028 [salt.crypt :1693][DEBUG ][2977289] Failed to authenticate message
2025-06-17 10:39:31,039 [salt.crypt :1174][DEBUG ][2977289] Decrypting the current master AES key
2025-06-17 10:39:31,039 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,040 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,042 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,042 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,043 [salt.channel.client:371 ][DEBUG ][2977289] Closing AsyncReqChannel instance
2025-06-17 10:39:31,044 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,044 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,050 [salt.crypt :1693][DEBUG ][2977289] Failed to authenticate message
2025-06-17 10:39:31,050 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,050 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,055 [salt.crypt :1693][DEBUG ][2977289] Failed to authenticate message
2025-06-17 10:39:31,066 [salt.crypt :1174][DEBUG ][2977289] Decrypting the current master AES key
2025-06-17 10:39:31,066 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,066 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,069 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,069 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,070 [salt.channel.client:371 ][DEBUG ][2977289] Closing AsyncReqChannel instance
2025-06-17 10:39:31,071 [salt.crypt :364 ][DEBUG ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,071 [salt.crypt :1100][DEBUG ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,076 [salt.crypt :1693][DEBUG ][2977289] Failed to authenticate message
2025-06-17 10:39:31,077 [salt.channel.client:371 ][DEBUG ][2977289] Closing AsyncReqChannel instance
2025-06-17 10:39:31,078 [salt.minion :1076][ERROR ][2977289] Error dispatching event. message authentication failed
2025-06-17 10:39:31,461 [salt.transport.tcp:1669][DEBUG ][2223591] Closing _TCPPubServerPublisher instance
messages start getting Failed to authenticate message on them
Output logs from a failed minion connection
2025-06-17 13:26:05,451 [salt.minion :1795][INFO ][884562] User root Executing command test.ping with jid 20250617112605444030
2025-06-17 13:26:05,518 [salt.minion :2002][INFO ][906287] Starting a new job 20250617112605444030 with PID 906287
2025-06-17 13:26:05,523 [salt.minion :2300][INFO ][906287] Returning information for job: 20250617112605444030
2025-06-17 13:26:05,531 [salt.minion :2859][ERROR ][884562] Skipping req for other master: cmd=_return master=salt-master1.domain.local id=ed54dccd-5139-437d-8951-8f541c9c3a5f
2025-06-17 13:26:05,605 [salt.minion :1078][ERROR ][884562] Error dispatching event. message authentication failed
2025-06-17 13:26:07,465 [salt.minion :1795][INFO ][884562] User root Executing command saltutil.find_job with jid 20250617112607458336
2025-06-17 13:26:07,533 [salt.minion :2002][INFO ][906301] Starting a new job 20250617112607458336 with PID 906301
2025-06-17 13:26:07,539 [salt.minion :2300][INFO ][906301] Returning information for job: 20250617112607458336
2025-06-17 13:26:07,547 [salt.minion :2859][ERROR ][884562] Skipping req for other master: cmd=_return master=salt-master1.domain.local id=5f28d1e2-8a1a-40ce-97d3-ce7d2d861329
2025-06-17 13:26:07,628 [salt.minion :1078][ERROR ][884562] Error dispatching event. message authentication failed
2025-06-17 13:26:35,562 [salt.utils.process:1004][ERROR ][906287] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250617112605444030)' was caught:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 999, in wrapped_run_func
return run_func()
File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1922, in _target
run_func(minion_instance, opts, data)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1919, in run_func
return Minion._thread_return(minion_instance, opts, data)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2155, in _thread_return
minion_instance._return_pub(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2390, in _return_pub
ret_val = self._send_req_sync(load, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1649, in _send_req_sync
raise TimeoutError("Request timed out")
TimeoutError: Request timed out
2025-06-17 13:26:37,576 [salt.utils.process:1004][ERROR ][906301] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250617112607458336)' was caught:
Traceback (most recent call last):
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 999, in wrapped_run_func
return run_func()
File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1922, in _target
run_func(minion_instance, opts, data)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1919, in run_func
return Minion._thread_return(minion_instance, opts, data)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2155, in _thread_return
minion_instance._return_pub(
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2390, in _return_pub
ret_val = self._send_req_sync(load, timeout=timeout)
File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1649, in _send_req_sync
raise TimeoutError("Request timed out")
TimeoutError: Request timed out
With additional instrumentation, I get this on a 3006.12 minion:
Jun 17 22:09:11 minion37 salt-minion[32101]: [ERROR ] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0xffff900f69e0>, <salt.ext.tornado.concurrent.Future object at 0xffff903e22f0>)
Jun 17 22:09:11 minion37 salt-minion[32101]: Traceback (most recent call last):
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 306, in _crypted_transfer
Jun 17 22:09:11 minion37 salt-minion[32101]: ret = yield _do_transfer()
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]: raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: yielded = self.gen.send(value)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 295, in _do_transfer
Jun 17 22:09:11 minion37 salt-minion[32101]: data = self.auth.session_crypticle.loads(data, raw, nonce=nonce)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py", line 1645, in loads
Jun 17 22:09:11 minion37 salt-minion[32101]: data = self.decrypt(data)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py", line 1617, in decrypt
Jun 17 22:09:11 minion37 salt-minion[32101]: raise AuthenticationError(f"message authentication failed: len(mac_bytes)={len(mac_bytes)}, len(sig)={len(sig)}, len(data)={len(data)}")
Jun 17 22:09:11 minion37 salt-minion[32101]: salt.exceptions.AuthenticationError: message authentication failed: len(mac_bytes)=32, len(sig)=8, len(data)=0
Jun 17 22:09:11 minion37 salt-minion[32101]: During handling of the above exception, another exception occurred:
Jun 17 22:09:11 minion37 salt-minion[32101]: Traceback (most recent call last):
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
Jun 17 22:09:11 minion37 salt-minion[32101]: ret = callback()
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
Jun 17 22:09:11 minion37 salt-minion[32101]: return fn(*args, **kwargs)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
Jun 17 22:09:11 minion37 salt-minion[32101]: future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]: raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: yielded = self.gen.throw(*exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1757, in _fire_master_main
Jun 17 22:09:11 minion37 salt-minion[32101]: yield self._send_req_async_main(load, timeout)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]: raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: yielded = self.gen.throw(*exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1698, in _send_req_async_main
Jun 17 22:09:11 minion37 salt-minion[32101]: ret = yield self.req_channel.send(
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]: raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: yielded = self.gen.throw(*exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 353, in send
Jun 17 22:09:11 minion37 salt-minion[32101]: ret = yield self._crypted_transfer(load, timeout=timeout, raw=raw)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]: raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: yielded = self.gen.throw(*exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 310, in _crypted_transfer
Jun 17 22:09:11 minion37 salt-minion[32101]: ret = yield _do_transfer()
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]: raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
Jun 17 22:09:11 minion37 salt-minion[32101]: yielded = self.gen.send(value)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 295, in _do_transfer
Jun 17 22:09:11 minion37 salt-minion[32101]: data = self.auth.session_crypticle.loads(data, raw, nonce=nonce)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py", line 1645, in loads
Jun 17 22:09:11 minion37 salt-minion[32101]: data = self.decrypt(data)
Jun 17 22:09:11 minion37 salt-minion[32101]: File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py", line 1617, in decrypt
Jun 17 22:09:11 minion37 salt-minion[32101]: raise AuthenticationError(f"message authentication failed: len(mac_bytes)={len(mac_bytes)}, len(sig)={len(sig)}, len(data)={len(data)}")
Jun 17 22:09:11 minion37 salt-minion[32101]: salt.exceptions.AuthenticationError: message authentication failed: len(mac_bytes)=32, len(sig)=8, len(data)=0
Jun 17 22:09:12 minion37 salt-minion[32101]: [ERROR ] Error dispatching event. message authentication failed: len(mac_bytes)=32, len(sig)=8, len(data)=0
The new question is: why does the master send a message which is only 8 bytes long? Is this a leftover from the old protocol which gets sent without HMAC by accident?
My gut tells me this is the root of these problems https://github.com/saltstack/salt/pull/68085/commits/7b4e6a7091680539ac57df216c33453b28dc41f5
I'm still not able to reproduce this issue but if the above change resolves the issue it'll give me enough to go on.
Please confirm, thanks.
@hailfinger Can you figure out what request is resulting in the reply from your last logs?
Can you figure out what request is resulting in the reply from your last logs?
Would a hexdump of the request be sufficient? It strikes me as odd that the request is only 8 bytes long in total, but AFAICS the IV alone should have been 16 bytes (AES_BLOCK_SIZE). That also means that manual decrpytion will fail due to AES block size.
Can you figure out what request is resulting in the reply from your last logs?
Would a hexdump of the request be sufficient? It strikes me as odd that the request is only 8 bytes long in total, but AFAICS the IV alone should have been 16 bytes (AES_BLOCK_SIZE). That also means that manual decrpytion will fail due to AES block size.
Something like this should help
diff --git a/salt/channel/client.py b/salt/channel/client.py
index 61cc79a255a..c1f3fdae01f 100644
--- a/salt/channel/client.py
+++ b/salt/channel/client.py
@@ -283,18 +283,21 @@ class AsyncReqChannel:
def _do_transfer():
# Yield control to the caller. When send() completes, resume by populating data with the Future.result
nonce = uuid.uuid4().hex
- data = yield self.transport.send(
- self._package_load(load, nonce),
- timeout=timeout,
- )
- # we may not have always data
- # as for example for saltcall ret submission, this is a blind
- # communication, we do not subscribe to return events, we just
- # upload the results to the master
- if data:
- data = self.auth.session_crypticle.loads(data, raw, nonce=nonce)
- if not raw or self.ttype == "tcp": # XXX Why is this needed for tcp
- data = salt.transport.frame.decode_embedded_strs(data)
+ try:
+ data = yield self.transport.send(
+ self._package_load(load, nonce),
+ timeout=timeout,
+ )
+ # we may not have always data
+ # as for example for saltcall ret submission, this is a blind
+ # communication, we do not subscribe to return events, we just
+ # upload the results to the master
+ if data:
+ data = self.auth.session_crypticle.loads(data, raw, nonce=nonce)
+ if not raw or self.ttype == "tcp": # XXX Why is this needed for tcp
+ data = salt.transport.frame.decode_embedded_strs(data)
+ except Exception as exc:
+ log.warning("Request failed request=%r reply=%r", load, data)
raise salt.ext.tornado.gen.Return(data)
Something like this should help
diff --git a/salt/channel/client.py b/salt/channel/client.py index 61cc79a255a..c1f3fdae01f 100644 --- a/salt/channel/client.py +++ b/salt/channel/client.py
Should I add this patch on the master or the minion or both?
We noticed this issue in a test environment with Salt version 3007.1. The problem seems to have gotten worse for us in 3006.12 though.
I would like to add that the minions are still connected to the master. They can still apply state and receive data from the master, but somehow get a timeout when sending data back.
For example, it is possible to restart the minions from the master with:
salt '*' minion.restart
edit: Our salt minions 3006.10 and below do not seem to show this behavior.
@dwoz - you can reproduce this pretty quickly by setting publish_session to a really low value. I set it to 60s then ran test.ping in a loop against a minion and triggered it after less than a minute.. I should note that I would see the minion recover, but that may be down to the low value of publish_session - if folks waited another 24hrs in the real world, they may see the same.
Applied the patch to salt/channel/client.py, this is the output
2025-06-18 15:15:01,027 [salt.channel.client:300 ][WARNING ][5339] Request failed request={'cmd': '_return', 'id': 'salt-dev1', 'success': True, 'return': True, 'retcode': 0, 'jid': '20250618151500945365', 'fun': 'test.ping', 'fun_args': [], 'user': 'sudo_barney', '_stamp': '2025-06-18T15:15:01.017423', 'nonce': 'df8840d36a004fbc8608e4abb223f2ab', 'ts': 1750259701, 'tok': b'\x1c\xfd@\x0e>_\xadvq\xc9\xc3\xcf\xd6\x0e\xe205U\x95R4\x80M`)\x85,\xc7\xfc\xdc\x9e\\}6\x1fD\x10\xa5\xfcaj\x15D\xe9\xad\xecE\\/\xbc\x0e;H\xd1oS/@\xdb\x82=\xc5\x04~\xab\xa5\xbb\xa5\xf1\xdf\x80\x9a\xc9\x18\xc9\xd7\xc1\xdb\xccn\x87\x07\x0e\xe7F\xe4\xee\x06H\xc9\nG\xea\x16\x83xA\xf7\x1c\x0e\xe0U\xde\r\t\xb9\xfev\xdb\x16\xf0\x1d&pK\x87\x18/\xfa\xf3\xc1ux,<\x9c\\\xddf?\x9c\xf8\x86\x05\xcb5\x96"F\x00\t\x1e\xf7\xbfS? -\x19\xa4\x89d\xb3\xc4\x99&\xf98Fn\x1f\xe4\x14$\xcf6\x8bo$$\x19z\xe7\x8c\xeab\x8e\xe5_o*\xf0j\xa2h\xa2\xfd\x9c\xaf\xd0\xbe\xc0\xe4\x93K^j\t!\t\x1e\xf5\xb2\x90\x85\x14\x01\x1e\xf87\xdf\x7f\x81\xc3\x9a\x1fG\x81-\xf9\x8d\xa1\x1c\xec\xf0jP\xda\x96\xa7\x15\xa7$\xda\x9e\x91\x07\xf7\xd3j\r\xe6\x18y\xc1\tO\xddp\x19\xb0\xf1\xb0\x0e\xc6\x1b'} reply='bad load'
Also tried the patch to comment out the use of self.sessions (https://github.com/saltstack/salt/commit/7b4e6a7091680539ac57df216c33453b28dc41f5?diff=unified) but that made no difference.