salt icon indicating copy to clipboard operation
salt copied to clipboard

[BUG] 3007.4 salt.utils.process timeout

Open SwimGeek opened this issue 6 months ago • 10 comments

Description Since upgrade to 3007.4 the majority of my salt minions have disconnected from the master.

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.

  • [ ] on-prem machine
  • [x] VM (Virtualbox, KVM, etc. please specify)
  • [ ] VM running on a cloud service, please be explicit and add details
  • [ ] container (Kubernetes, Docker, containerd, etc. please specify)
  • [ ] or a combination, please be explicit
  • [ ] jails if it is FreeBSD
  • [ ] classic packaging
  • [x] onedir packaging
  • [ ] used bootstrap to install

Steps to Reproduce the behavior Upgrade to latest salt verion

Expected behavior Minions should stay connected to the master.

Versions Report

Salt Version: Salt: 3007.4

Python Version: Python: 3.10.17 (main, Jun 9 2025, 20:41:48) [GCC 11.2.0]

Additional context

Minion Log:

2025-06-16 11:43:43,324 [salt.utils.process:1004][ERROR ][722938] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250616094243035566)' was caught: Traceback (most recent call last): File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 999, in wrapped_run_func return run_func() File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1920, in _target run_func(minion_instance, opts, data) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1917, in run_func return Minion._thread_return(minion_instance, opts, data) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2153, in _thread_return minion_instance._return_pub(ret) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2384, in _return_pub ret_val = self._send_req_sync(load, timeout=timeout) File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1647, in _send_req_sync raise TimeoutError("Request timed out") TimeoutError: Request timed out

SwimGeek avatar Jun 16 '25 09:06 SwimGeek

I'm second to this , happen on 3006.12 (master) and minion (3006.12)

2025-06-15 10:04:03,124 [salt.utils.process:1004][ERROR   ][4028823] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250615020302931331)' was caught:
Traceback (most recent call last):

then from master not able to test.ping to minion

it's happen randomly , will keep monitor , btw I haven't test on the other minion version yet.

broxio avatar Jun 16 '25 10:06 broxio

Also seeing this across hundreds of minions, all since upgrading minions to 3006.12 and connected to 3006.12 master.

Minion log

2025-06-16 08:37:28,539 [salt.minion      :1078][ERROR   ][3864977] Error dispatching event. message authentication failed
2025-06-16 08:37:49,187 [salt.minion      :1078][ERROR   ][3864977] Error dispatching event. message authentication failed
2025-06-16 08:38:27,262 [salt.utils.process:1004][ERROR   ][1202995] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250616073727015796)' was caught:
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 999, in wrapped_run_func
    return run_func()
  File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1927, in _target
    run_func(minion_instance, opts, data)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1921, in run_func
    return Minion._thread_return(minion_instance, opts, data)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2157, in _thread_return
    minion_instance._return_pub(ret)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2385, in _return_pub
    ret_val = self._send_req_sync(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1650, in _send_req_sync
    raise TimeoutError("Request timed out")
TimeoutError: Request timed out

Master logs successful authentication request:

2025-06-16 08:37:28,190 [salt.channel.server:420 ][INFO    ][3228353] Authentication request from minion.host
2025-06-16 08:37:28,190 [salt.channel.server:705 ][INFO    ][3228353] Authentication accepted from minion.host

ukdaf avatar Jun 16 '25 11:06 ukdaf

All my 3006.12 minions normally connected to a 3006.12 master exhibit the same issue. Usually one 3006.12 minion starts throwing an salt.exceptions.AuthenticationError, then the part of the minion fleet running 3006.12 falls over within at most 10 minutes, usually within one minute.

However, the minions running an older Salt release stay connected without issue.

hailfinger avatar Jun 16 '25 11:06 hailfinger

There are two workarounds:

  • Restart all affeced minions by manually logging in via SSH
  • Restart the master, minions will reconnect

hailfinger avatar Jun 16 '25 11:06 hailfinger

Thanks, restarting the master worked.

SwimGeek avatar Jun 16 '25 11:06 SwimGeek

But, an hour later I've seen a few minions disconnect again.

SwimGeek avatar Jun 16 '25 12:06 SwimGeek

But, an hour later I've seen a few minions disconnect again.

I had the same thing happening, now I cleared the cache on my master and so far the problem seems to be solved.

nalkema avatar Jun 16 '25 13:06 nalkema

for those seeing this issue, did you ensure the master was upgraded before minion? the latest release has a backward-incompatible minion change that requires master to be first upgraded.

mattp- avatar Jun 16 '25 14:06 mattp-

for those seeing this issue, did you ensure the master was upgraded before minion? the latest release has a backward-incompatible minion change that requires master to be first upgraded.

Minions were restarted after master was upgraded (but most minions were upgraded before master).

ukdaf avatar Jun 16 '25 14:06 ukdaf

It's pretty telling that the issues are with the new 3006 and 3007 minions - I started to roll out the new 3006.12 on the master and a few minions before encountering heavy issues and pausing the upgrades- trying a salt -G 'saltversion:3006.12' test.ping had 1 out of 5 minions responding, versus a salt -C 'not G@saltversion:3006.12' test.ping where everything was still connected correctly. Restarting the master did restore the connection to all 3006.12 minions, but it probably won't last

benfiedler avatar Jun 16 '25 15:06 benfiedler

@SwimGeek can you please update the bug title to include 3006.12 as well to get additional visibility?

hailfinger avatar Jun 16 '25 21:06 hailfinger

For me, the preceding error is in /opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py line 1617 (for Saltstack 3006.12). That's the function decrypt(). My guess is that the variable data has insufficient size, maybe even size zero.

hailfinger avatar Jun 16 '25 21:06 hailfinger

Instrumentation of the affected line:

            raise AuthenticationError(f"message authentication failed: len(mac_bytes)={len(mac_bytes)}, len(sig)={len(sig)}, len(data)={len(data)}")

hailfinger avatar Jun 16 '25 21:06 hailfinger

I'm having troubles re-producing this behavior. Can someone please try the following change?

https://github.com/saltstack/salt/pull/68085/commits/460bce3a18fbf2e779fbb220e7b8bf5e2de82211

dwoz avatar Jun 17 '25 00:06 dwoz

I patched one of my minion, I should see tomorrow if the error reappears.

ldidry avatar Jun 17 '25 06:06 ldidry

I do patched on some of 3006.12 ,seems does the job. , will observe for about 24 hours.

broxio avatar Jun 17 '25 07:06 broxio

I'm having troubles re-producing this behavior. Can someone please try the following change?

460bce3

Looks like it work. Tested on our test environment. salt-minion-3007.4-0.x86_64

On our test env, we have 5 virtual machines. Two of them are master and client (salt-minion) and the three other are just salt-minion.

Before the patch, when running the following

salt '*' test.ping
salt '*' cmd.run "uname -a"

followed by a new test.ping

salt '*' test.ping

it raise the timeout error.

After the patch, running the following multiple time:

salt '*' test.ping
salt '*' cmd.run "uname -a"

followed by a new test.ping

salt '*' test.ping

no more timeout error yet.

iMikeG6 avatar Jun 17 '25 07:06 iMikeG6

But, an hour later I've seen a few minions disconnect again.

After some minions disconnected, I deleted the job cache on the master with a restart.

This worked for a while... but the next morning only 4 minions are connected (of 20+). Something is very broken.

SwimGeek avatar Jun 17 '25 08:06 SwimGeek

I'm having troubles re-producing this behavior. Can someone please try the following change? 460bce3

Looks like it work. Tested on our test environment. salt-minion-3007.4-0.x86_64

On our test env, we have 5 virtual machines. Two of them are master and client (salt-minion) and the three other are just salt-minion.

Before the patch, when running the following

salt '' test.ping salt '' cmd.run "uname -a"

followed by a new test.ping

salt '*' test.ping

it raise the timeout error.

After the patch, running the following multiple time:

salt '' test.ping salt '' cmd.run "uname -a"

followed by a new test.ping

salt '*' test.ping

no more timeout error yet.

after awhile, it happened again

  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1649, in _send_req_sync
    raise TimeoutError("Request timed out")
TimeoutError: Request timed out

iMikeG6 avatar Jun 17 '25 08:06 iMikeG6

after awhile, it happened again

Same here.

ldidry avatar Jun 17 '25 10:06 ldidry

heres my logs from a minion in debug when it started happening. (3007.4 minion) after these Failed to authenticate message start appearing the cmd.run's or test.ping's will fail to the minion

2025-06-17 10:39:30,892 [salt.utils.schedule:1762][DEBUG   ][2977289] schedule: Job __mine_interval was scheduled with jid_include, adding to cache (jid_include defaults to True)
2025-06-17 10:39:30,893 [salt.utils.schedule:1768][DEBUG   ][2977289] schedule: Job __mine_interval was scheduled with a max number of 2
2025-06-17 10:39:30,893 [salt.utils.schedule:1792][INFO    ][2977289] Running scheduled job: __mine_interval with jid 20250617093930893503
2025-06-17 10:39:30,897 [salt.utils.process:1167][DEBUG   ][2977289] Subprocess Schedule(name=__mine_interval, jid=20250617093930893503) added
2025-06-17 10:39:30,949 [salt.loader.lazy :996 ][DEBUG   ][2223591] The functions from module 'mine' are being loaded by dir() on the loaded module
2025-06-17 10:39:30,949 [salt.utils.lazy  :98  ][DEBUG   ][2223591] LazyLoaded mine.update
2025-06-17 10:39:30,950 [salt.utils.schedule:821 ][DEBUG   ][2223591] schedule.handle_func: adding this job to the jobcache with data {'id': 'influx-grafana-dev.', 'fun': 'mine.update', 'fun_args': [], 'schedule': '__mine_interval', 'jid': '20250617093930893503', 'pid': 2223591}
2025-06-17 10:39:30,951 [salt.loader.lazy :996 ][DEBUG   ][2223591] The functions from module 'config' are being loaded by dir() on the loaded module
2025-06-17 10:39:30,952 [salt.utils.lazy  :98  ][DEBUG   ][2223591] LazyLoaded config.merge
2025-06-17 10:39:30,952 [salt.loader.lazy :996 ][DEBUG   ][2223591] The functions from module 'pillar' are being loaded by dir() on the loaded module
2025-06-17 10:39:30,953 [salt.utils.lazy  :98  ][DEBUG   ][2223591] LazyLoaded pillar.get
2025-06-17 10:39:30,954 [asyncio          :54  ][DEBUG   ][2223591] Using selector: EpollSelector
2025-06-17 10:39:30,956 [asyncio          :54  ][DEBUG   ][2223591] Using selector: EpollSelector
2025-06-17 10:39:30,958 [salt.transport.tcp:1623][DEBUG   ][2223591] Publisher connecting to /var/run/salt/minion/minion_event_1cb681de76_pull.ipc
2025-06-17 10:39:30,961 [salt.minion      :2822][DEBUG   ][2977289] Minion of 'saltdirector.' is handling event tag '_minion_mine'
2025-06-17 10:39:30,962 [asyncio          :54  ][DEBUG   ][2977289] Using selector: EpollSelector
2025-06-17 10:39:30,962 [salt.crypt       :677 ][DEBUG   ][2977289] Initializing new AsyncAuth for ('/etc/salt/pki/minion', 'influx-grafana-dev.', 'tcp://IPADDRREMOVED:4506')
2025-06-17 10:39:30,966 [salt.minion      :896 ][DEBUG   ][2977289] Minion return retry timer set to 7 seconds (randomized)
2025-06-17 10:39:30,966 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:30,967 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:30,974 [salt.crypt       :1693][DEBUG   ][2977289] Failed to authenticate message
2025-06-17 10:39:31,011 [salt.crypt       :1174][DEBUG   ][2977289] Decrypting the current master AES key
2025-06-17 10:39:31,011 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,011 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,015 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,015 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,017 [salt.channel.client:371 ][DEBUG   ][2977289] Closing AsyncReqChannel instance
2025-06-17 10:39:31,018 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,018 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,023 [salt.crypt       :1693][DEBUG   ][2977289] Failed to authenticate message
2025-06-17 10:39:31,023 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,024 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,028 [salt.crypt       :1693][DEBUG   ][2977289] Failed to authenticate message
2025-06-17 10:39:31,039 [salt.crypt       :1174][DEBUG   ][2977289] Decrypting the current master AES key
2025-06-17 10:39:31,039 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,040 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,042 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,042 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,043 [salt.channel.client:371 ][DEBUG   ][2977289] Closing AsyncReqChannel instance
2025-06-17 10:39:31,044 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,044 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,050 [salt.crypt       :1693][DEBUG   ][2977289] Failed to authenticate message
2025-06-17 10:39:31,050 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,050 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,055 [salt.crypt       :1693][DEBUG   ][2977289] Failed to authenticate message
2025-06-17 10:39:31,066 [salt.crypt       :1174][DEBUG   ][2977289] Decrypting the current master AES key
2025-06-17 10:39:31,066 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,066 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,069 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,069 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,070 [salt.channel.client:371 ][DEBUG   ][2977289] Closing AsyncReqChannel instance
2025-06-17 10:39:31,071 [salt.crypt       :364 ][DEBUG   ][2977289] salt.crypt.get_rsa_key: Loading private key
2025-06-17 10:39:31,071 [salt.crypt       :1100][DEBUG   ][2977289] Loaded minion key: /etc/salt/pki/minion/minion.pem
2025-06-17 10:39:31,076 [salt.crypt       :1693][DEBUG   ][2977289] Failed to authenticate message
2025-06-17 10:39:31,077 [salt.channel.client:371 ][DEBUG   ][2977289] Closing AsyncReqChannel instance
2025-06-17 10:39:31,078 [salt.minion      :1076][ERROR   ][2977289] Error dispatching event. message authentication failed
2025-06-17 10:39:31,461 [salt.transport.tcp:1669][DEBUG   ][2223591] Closing _TCPPubServerPublisher instance

messages start getting Failed to authenticate message on them

ITJamie avatar Jun 17 '25 11:06 ITJamie

Output logs from a failed minion connection

2025-06-17 13:26:05,451 [salt.minion      :1795][INFO    ][884562] User root Executing command test.ping with jid 20250617112605444030
2025-06-17 13:26:05,518 [salt.minion      :2002][INFO    ][906287] Starting a new job 20250617112605444030 with PID 906287
2025-06-17 13:26:05,523 [salt.minion      :2300][INFO    ][906287] Returning information for job: 20250617112605444030
2025-06-17 13:26:05,531 [salt.minion      :2859][ERROR   ][884562] Skipping req for other master: cmd=_return master=salt-master1.domain.local id=ed54dccd-5139-437d-8951-8f541c9c3a5f
2025-06-17 13:26:05,605 [salt.minion      :1078][ERROR   ][884562] Error dispatching event. message authentication failed
2025-06-17 13:26:07,465 [salt.minion      :1795][INFO    ][884562] User root Executing command saltutil.find_job with jid 20250617112607458336
2025-06-17 13:26:07,533 [salt.minion      :2002][INFO    ][906301] Starting a new job 20250617112607458336 with PID 906301
2025-06-17 13:26:07,539 [salt.minion      :2300][INFO    ][906301] Returning information for job: 20250617112607458336
2025-06-17 13:26:07,547 [salt.minion      :2859][ERROR   ][884562] Skipping req for other master: cmd=_return master=salt-master1.domain.local id=5f28d1e2-8a1a-40ce-97d3-ce7d2d861329
2025-06-17 13:26:07,628 [salt.minion      :1078][ERROR   ][884562] Error dispatching event. message authentication failed
2025-06-17 13:26:35,562 [salt.utils.process:1004][ERROR   ][906287] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250617112605444030)' was caught:
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 999, in wrapped_run_func
    return run_func()
  File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1922, in _target
    run_func(minion_instance, opts, data)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1919, in run_func
    return Minion._thread_return(minion_instance, opts, data)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2155, in _thread_return
    minion_instance._return_pub(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2390, in _return_pub
    ret_val = self._send_req_sync(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1649, in _send_req_sync
    raise TimeoutError("Request timed out")
TimeoutError: Request timed out
2025-06-17 13:26:37,576 [salt.utils.process:1004][ERROR   ][906301] An un-handled exception from the multiprocessing process 'ProcessPayload(jid=20250617112607458336)' was caught:
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/utils/process.py", line 999, in wrapped_run_func
    return run_func()
  File "/opt/saltstack/salt/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1922, in _target
    run_func(minion_instance, opts, data)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1919, in run_func
    return Minion._thread_return(minion_instance, opts, data)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2155, in _thread_return
    minion_instance._return_pub(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 2390, in _return_pub
    ret_val = self._send_req_sync(load, timeout=timeout)
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1649, in _send_req_sync
    raise TimeoutError("Request timed out")
TimeoutError: Request timed out

iMikeG6 avatar Jun 17 '25 11:06 iMikeG6

With additional instrumentation, I get this on a 3006.12 minion:

Jun 17 22:09:11 minion37 salt-minion[32101]: [ERROR   ] Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0xffff900f69e0>, <salt.ext.tornado.concurrent.Future object at 0xffff903e22f0>)
Jun 17 22:09:11 minion37 salt-minion[32101]: Traceback (most recent call last):
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 306, in _crypted_transfer
Jun 17 22:09:11 minion37 salt-minion[32101]:     ret = yield _do_transfer()
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]:     raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     yielded = self.gen.send(value)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 295, in _do_transfer
Jun 17 22:09:11 minion37 salt-minion[32101]:     data = self.auth.session_crypticle.loads(data, raw, nonce=nonce)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py", line 1645, in loads
Jun 17 22:09:11 minion37 salt-minion[32101]:     data = self.decrypt(data)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py", line 1617, in decrypt
Jun 17 22:09:11 minion37 salt-minion[32101]:     raise AuthenticationError(f"message authentication failed: len(mac_bytes)={len(mac_bytes)}, len(sig)={len(sig)}, len(data)={len(data)}")
Jun 17 22:09:11 minion37 salt-minion[32101]: salt.exceptions.AuthenticationError: message authentication failed: len(mac_bytes)=32, len(sig)=8, len(data)=0
Jun 17 22:09:11 minion37 salt-minion[32101]: During handling of the above exception, another exception occurred:
Jun 17 22:09:11 minion37 salt-minion[32101]: Traceback (most recent call last):
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 606, in _run_callback
Jun 17 22:09:11 minion37 salt-minion[32101]:     ret = callback()
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/stack_context.py", line 278, in null_wrapper
Jun 17 22:09:11 minion37 salt-minion[32101]:     return fn(*args, **kwargs)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 628, in _discard_future_result
Jun 17 22:09:11 minion37 salt-minion[32101]:     future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]:     raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     yielded = self.gen.throw(*exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1757, in _fire_master_main
Jun 17 22:09:11 minion37 salt-minion[32101]:     yield self._send_req_async_main(load, timeout)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]:     raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     yielded = self.gen.throw(*exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 1698, in _send_req_async_main
Jun 17 22:09:11 minion37 salt-minion[32101]:     ret = yield self.req_channel.send(
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]:     raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     yielded = self.gen.throw(*exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 353, in send
Jun 17 22:09:11 minion37 salt-minion[32101]:     ret = yield self._crypted_transfer(load, timeout=timeout, raw=raw)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]:     raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1064, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     yielded = self.gen.throw(*exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 310, in _crypted_transfer
Jun 17 22:09:11 minion37 salt-minion[32101]:     ret = yield _do_transfer()
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1056, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     value = future.result()
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
Jun 17 22:09:11 minion37 salt-minion[32101]:     raise_exc_info(self._exc_info)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "<string>", line 4, in raise_exc_info
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/gen.py", line 1070, in run
Jun 17 22:09:11 minion37 salt-minion[32101]:     yielded = self.gen.send(value)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 295, in _do_transfer
Jun 17 22:09:11 minion37 salt-minion[32101]:     data = self.auth.session_crypticle.loads(data, raw, nonce=nonce)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py", line 1645, in loads
Jun 17 22:09:11 minion37 salt-minion[32101]:     data = self.decrypt(data)
Jun 17 22:09:11 minion37 salt-minion[32101]:   File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/crypt.py", line 1617, in decrypt
Jun 17 22:09:11 minion37 salt-minion[32101]:     raise AuthenticationError(f"message authentication failed: len(mac_bytes)={len(mac_bytes)}, len(sig)={len(sig)}, len(data)={len(data)}")
Jun 17 22:09:11 minion37 salt-minion[32101]: salt.exceptions.AuthenticationError: message authentication failed: len(mac_bytes)=32, len(sig)=8, len(data)=0
Jun 17 22:09:12 minion37 salt-minion[32101]: [ERROR   ] Error dispatching event. message authentication failed: len(mac_bytes)=32, len(sig)=8, len(data)=0

The new question is: why does the master send a message which is only 8 bytes long? Is this a leftover from the old protocol which gets sent without HMAC by accident?

hailfinger avatar Jun 17 '25 22:06 hailfinger

My gut tells me this is the root of these problems https://github.com/saltstack/salt/pull/68085/commits/7b4e6a7091680539ac57df216c33453b28dc41f5

I'm still not able to reproduce this issue but if the above change resolves the issue it'll give me enough to go on.

Please confirm, thanks.

dwoz avatar Jun 17 '25 22:06 dwoz

@hailfinger Can you figure out what request is resulting in the reply from your last logs?

dwoz avatar Jun 17 '25 22:06 dwoz

Can you figure out what request is resulting in the reply from your last logs?

Would a hexdump of the request be sufficient? It strikes me as odd that the request is only 8 bytes long in total, but AFAICS the IV alone should have been 16 bytes (AES_BLOCK_SIZE). That also means that manual decrpytion will fail due to AES block size.

hailfinger avatar Jun 17 '25 22:06 hailfinger

Can you figure out what request is resulting in the reply from your last logs?

Would a hexdump of the request be sufficient? It strikes me as odd that the request is only 8 bytes long in total, but AFAICS the IV alone should have been 16 bytes (AES_BLOCK_SIZE). That also means that manual decrpytion will fail due to AES block size.

Something like this should help

diff --git a/salt/channel/client.py b/salt/channel/client.py
index 61cc79a255a..c1f3fdae01f 100644
--- a/salt/channel/client.py
+++ b/salt/channel/client.py
@@ -283,18 +283,21 @@ class AsyncReqChannel:
         def _do_transfer():
             # Yield control to the caller. When send() completes, resume by populating data with the Future.result
             nonce = uuid.uuid4().hex
-            data = yield self.transport.send(
-                self._package_load(load, nonce),
-                timeout=timeout,
-            )
-            # we may not have always data
-            # as for example for saltcall ret submission, this is a blind
-            # communication, we do not subscribe to return events, we just
-            # upload the results to the master
-            if data:
-                data = self.auth.session_crypticle.loads(data, raw, nonce=nonce)
-            if not raw or self.ttype == "tcp":  # XXX Why is this needed for tcp
-                data = salt.transport.frame.decode_embedded_strs(data)
+            try:
+                data = yield self.transport.send(
+                    self._package_load(load, nonce),
+                    timeout=timeout,
+                )
+                # we may not have always data
+                # as for example for saltcall ret submission, this is a blind
+                # communication, we do not subscribe to return events, we just
+                # upload the results to the master
+                if data:
+                    data = self.auth.session_crypticle.loads(data, raw, nonce=nonce)
+                if not raw or self.ttype == "tcp":  # XXX Why is this needed for tcp
+                    data = salt.transport.frame.decode_embedded_strs(data)
+            except Exception as exc:
+                log.warning("Request failed request=%r reply=%r", load, data)
 
             raise salt.ext.tornado.gen.Return(data)

dwoz avatar Jun 18 '25 02:06 dwoz

Something like this should help

diff --git a/salt/channel/client.py b/salt/channel/client.py
index 61cc79a255a..c1f3fdae01f 100644
--- a/salt/channel/client.py
+++ b/salt/channel/client.py

Should I add this patch on the master or the minion or both?

hailfinger avatar Jun 18 '25 11:06 hailfinger

We noticed this issue in a test environment with Salt version 3007.1. The problem seems to have gotten worse for us in 3006.12 though.

I would like to add that the minions are still connected to the master. They can still apply state and receive data from the master, but somehow get a timeout when sending data back.

For example, it is possible to restart the minions from the master with:

salt '*' minion.restart

edit: Our salt minions 3006.10 and below do not seem to show this behavior.

avanheuvelen avatar Jun 18 '25 11:06 avanheuvelen

@dwoz - you can reproduce this pretty quickly by setting publish_session to a really low value. I set it to 60s then ran test.ping in a loop against a minion and triggered it after less than a minute.. I should note that I would see the minion recover, but that may be down to the low value of publish_session - if folks waited another 24hrs in the real world, they may see the same.

Applied the patch to salt/channel/client.py, this is the output

2025-06-18 15:15:01,027 [salt.channel.client:300 ][WARNING ][5339] Request failed request={'cmd': '_return', 'id': 'salt-dev1', 'success': True, 'return': True, 'retcode': 0, 'jid': '20250618151500945365', 'fun': 'test.ping', 'fun_args': [], 'user': 'sudo_barney', '_stamp': '2025-06-18T15:15:01.017423', 'nonce': 'df8840d36a004fbc8608e4abb223f2ab', 'ts': 1750259701, 'tok': b'\x1c\xfd@\x0e>_\xadvq\xc9\xc3\xcf\xd6\x0e\xe205U\x95R4\x80M`)\x85,\xc7\xfc\xdc\x9e\\}6\x1fD\x10\xa5\xfcaj\x15D\xe9\xad\xecE\\/\xbc\x0e;H\xd1oS/@\xdb\x82=\xc5\x04~\xab\xa5\xbb\xa5\xf1\xdf\x80\x9a\xc9\x18\xc9\xd7\xc1\xdb\xccn\x87\x07\x0e\xe7F\xe4\xee\x06H\xc9\nG\xea\x16\x83xA\xf7\x1c\x0e\xe0U\xde\r\t\xb9\xfev\xdb\x16\xf0\x1d&pK\x87\x18/\xfa\xf3\xc1ux,<\x9c\\\xddf?\x9c\xf8\x86\x05\xcb5\x96"F\x00\t\x1e\xf7\xbfS? -\x19\xa4\x89d\xb3\xc4\x99&\xf98Fn\x1f\xe4\x14$\xcf6\x8bo$$\x19z\xe7\x8c\xeab\x8e\xe5_o*\xf0j\xa2h\xa2\xfd\x9c\xaf\xd0\xbe\xc0\xe4\x93K^j\t!\t\x1e\xf5\xb2\x90\x85\x14\x01\x1e\xf87\xdf\x7f\x81\xc3\x9a\x1fG\x81-\xf9\x8d\xa1\x1c\xec\xf0jP\xda\x96\xa7\x15\xa7$\xda\x9e\x91\x07\xf7\xd3j\r\xe6\x18y\xc1\tO\xddp\x19\xb0\xf1\xb0\x0e\xc6\x1b'} reply='bad load'

Also tried the patch to comment out the use of self.sessions (https://github.com/saltstack/salt/commit/7b4e6a7091680539ac57df216c33453b28dc41f5?diff=unified) but that made no difference.

barneysowood avatar Jun 18 '25 15:06 barneysowood