salt
salt copied to clipboard
[BUG] State cmd.script with parallel set to true hangs in 3006.5
Description After upgrading from 3006.3 to 3006.5 we started seeing hanging minion processes consuming 100% CPU. Narrowed it down to salt.script states with parallel: True.
Setup SLS:
test:
cmd.script:
- source: salt://scripts/test.sh
- name: test.sh
- shell: /bin/bash
- parallel: True
Script:
#!/bin/bash
echo "test"
exit 0
Please be as specific as possible and give set-up details.
- [ ] on-prem machine
- [ ] VM (Virtualbox, KVM, etc. please specify)
- [x] VM running on a cloud service, please be explicit and add details
- [ ] container (Kubernetes, Docker, containerd, etc. please specify)
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [ ] classic packaging
- [x] onedir packaging
- [ ] used bootstrap to install
Steps to Reproduce the behavior
salt 'asx-test' state.apply test
Relevant minion log on asx-test:
2023-12-15 15:57:17,948 [salt.fileclient :1363][INFO ][270247] Fetching file from saltenv 'base', ** done ** 'test/init.sls'
2023-12-15 15:57:17,948 [salt.template :53 ][DEBUG ][270247] compile template: /var/cache/salt/minion/files/base/test/init.sls
2023-12-15 15:57:17,949 [salt.utils.jinja :79 ][DEBUG ][270247] Jinja search path: ['/var/cache/salt/minion/files/base']
2023-12-15 15:57:17,952 [salt.template :23 ][PROFILE ][270247] Time (in seconds) to render '/var/cache/salt/minion/files/base/test/init.sls' using 'jinja' renderer: 0.003893613815307617
2023-12-15 15:57:17,953 [salt.template :116 ][DEBUG ][270247] Rendered data from file: /var/cache/salt/minion/files/base/test/init.sls:
test:
cmd.script:
- source: salt://scripts/test.sh
- name: test.sh
- shell: /bin/bash
- parallel: True
2023-12-15 15:57:17,954 [salt.template :23 ][PROFILE ][270247] Time (in seconds) to render '/var/cache/salt/minion/files/base/test/init.sls' using 'yaml' renderer: 0.0012822151184082031
2023-12-15 15:57:17,955 [salt.loader.lazy :977 ][DEBUG ][270247] The functions from module 'config' are being loaded by dir() on the loaded module
2023-12-15 15:57:17,955 [salt.utils.lazy :99 ][DEBUG ][270247] LazyLoaded config.option
2023-12-15 15:57:17,956 [salt.loader.lazy :977 ][DEBUG ][270247] The functions from module 'cmd' are being loaded by dir() on the loaded module
2023-12-15 15:57:17,956 [salt.utils.lazy :99 ][DEBUG ][270247] LazyLoaded cmd.script
2023-12-15 15:57:17,956 [salt.state :2295][INFO ][270247] Running state [test.sh] at time 15:57:17.956921
2023-12-15 15:57:17,957 [salt.state :2326][INFO ][270247] Executing state cmd.script for [test.sh]
2023-12-15 15:57:17,961 [salt.state :321 ][INFO ][270247] Started in a separate process
2023-12-15 15:57:17,962 [salt.state :2493][INFO ][270247] Completed state [test.sh] at time 15:57:17.962736 (duration_in_ms=5.81)
2023-12-15 15:57:17,968 [salt.loaded.int.log_handlers.fluent_mod:32 ][TRACE ][270261] The required configuration section, 'fluent_handler', was not found the in the configuration. Not loading the fluent logging handlers module.
2023-12-15 15:57:17,971 [salt.loaded.int.log_handlers.logstash_mod:32 ][TRACE ][270261] None of the required configuration sections, 'logstash_udp_handler' and 'logstash_zmq_handler', were found in the configuration. Not loading the Logstash logging handlers module.
2023-12-15 15:57:17,972 [salt.loader.lazy :32 ][TRACE ][270261] Error loading log_handlers.sentry_mod: Cannot find 'raven' python library, 'sentry_handler' config is empty or not defined
2023-12-15 15:57:17,975 [salt.loader.lazy :32 ][TRACE ][270261] Loaded cmdmod as virtual cmd
2023-12-15 15:57:17,976 [salt.loader.lazy :977 ][DEBUG ][270261] The functions from module 'cmd' are being loaded by dir() on the loaded module
2023-12-15 15:57:17,976 [salt.utils.lazy :99 ][DEBUG ][270261] LazyLoaded cmd.script
2023-12-15 15:57:17,977 [salt.loader.lazy :977 ][DEBUG ][270261] The functions from module 'cp' are being loaded by dir() on the loaded module
2023-12-15 15:57:17,978 [salt.utils.lazy :99 ][DEBUG ][270261] LazyLoaded cp.cache_file
2023-12-15 15:57:17,979 [salt.channel.client:32 ][TRACE ][270261] ReqChannel send crypt load={'path': 'scripts/test.sh', 'saltenv': 'base', 'cmd': '_file_hash'}
Silent for process 270261 after that.
top output for process 270261:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
270261 root 20 0 627832 58188 6368 S 100.0 0.0 4:29.45 /opt/saltstack/
root 270261 270247 99 15:57 ? 00:05:22 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-minion MultiMinionProcessManager MinionProcessManager ProcessPayload(jid=20231215145716488579) Minion._thread_return ParallelState(test.sh)
Expected behavior Expect state.apply to return
Versions Report
Salt Version:
Salt: 3006.5
Python Version:
Python: 3.10.13 (main, Nov 15 2023, 04:34:27) [GCC 11.2.0]
Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.2
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: 0.14.2
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: ubuntu 16.04.3 jammy
locale: utf-8
machine: x86_64
release: 5.15.0-88-generic
system: Linux
version: Ubuntu 16.04.3 jammy
I can reproduce this, and am having the same problem with file.managed states. Oracle Linux 8.9.
Can reproduce in 3006.7. Oracle Linux 8.9.
@johje349 This is on Ubuntu 16.04, can you confirm this is an issue on an OS which is currently supported and not itself EOL or is there a typo here 16.04 and jammy ?? not Xenial - something weird going on there - wonders what other strangeness could be occurring for an OS so EOL
Tried this with latest Salt 3007.0 on Centos 8 and able to reproduce when issuing the command from a salt-master, however the same command using salt-call succeeded before installing a salt-master :
- salt-call --local state.apply test
After running from a salt-master, even the local salt-call now fails (after a Crtl-C of salt command from salt-master)
[DEBUG ] The functions from module 'config' are being loaded by dir() on the loaded module
[DEBUG ] LazyLoaded config.option
[DEBUG ] The functions from module 'saltutil' are being loaded by dir() on the loaded module
[DEBUG ] LazyLoaded saltutil.is_running
[DEBUG ] The functions from module 'highstate' are being loaded by dir() on the loaded module
[DEBUG ] LazyLoaded highstate.output
[TRACE ] data = {'local': ['The function "state.apply" is running as PID 14127 and was started at 2024, Mar 20 16:56:15.055554 with jid 20240320165615055554']}
local:
Data failed to compile:
----------
The function "state.apply" is running as PID 14127 and was started at 2024, Mar 20 16:56:15.055554 with jid 20240320165615055554
even after a long systemctl restart salt-minion, the above command still fails the same
Interestingly this may be due to a problem, given that systemctl stop salt-minion still leaves processes running, including the PID 14127 which is still running even after the 'stop'
Once killed all salt-minion processes, then the salt-call --local command worked correctly
[DEBUG ] File /var/cache/salt/minion/accumulator/140237358224416 does not exist, no need to cleanup
[DEBUG ] The functions from module 'state' are being loaded by dir() on the loaded module
[DEBUG ] LazyLoaded state.check_result
[DEBUG ] The functions from module 'highstate' are being loaded by dir() on the loaded module
[DEBUG ] LazyLoaded highstate.output
[DEBUG ] The functions from module 'nested' are being loaded by dir() on the loaded module
[DEBUG ] LazyLoaded nested.output
[TRACE ] data = {'local': {'cmd_|-test_|-test.sh_|-script': {'name': 'test.sh', 'result': True, 'changes': {'pid': 20610, 'retcode': 0, 'stdout': 'test', 'stderr': ''}, 'comment': "Command 'test.sh' run", '__sls__': 'test', '__run_num__': 0, 'start_time': '11:14:11.568821', 'duration': '24.261 ms', '__id__': 'test', '__parallel__': True}}}
local:
----------
ID: test
Function: cmd.script
Name: test.sh
Result: True
Comment: Command 'test.sh' run
Started: 11:14:11.568821
Duration: 24.261 ms
Changes:
----------
pid:
20610
retcode:
0
stderr:
stdout:
test
Summary for local
------------
Succeeded: 1 (changed=1)
Failed: 0
------------
Total states run: 1
Total run time: 24.261 ms
[root@Unknown salt]#
looks like hitting the race condition in process killing children (circa line 700) and not handling nicely
problematic code (at the moment) appears to be here https://github.com/saltstack/salt/blob/66bb993cbd2103f00cbe40c13a89a9894c6ea1ed/salt/state.py#L2257-L2288
where not returning after the proc.start(), that is, code does not progress
@johje349 This is on Ubuntu 16.04, can you confirm this is an issue on an OS which is currently supported and not itself EOL or is there a typo here 16.04 and jammy ?? not Xenial - something weird going on there - wonders what other strangeness could be occurring for an OS so EOL
Actually this is on Ubuntu 22:
cat /etc/issue
Ubuntu 22.04.3 LTS \n \l
Salt versions-report reports Ubuntu 16 for some reason:
salt --versions-report | grep dist
dist: ubuntu 16.04.3 jammy`
The problem appears to be due a request for the cp.cache_file to the master getting lost since the parallel task has different pid etc, and gets lost in woodwork :( , see instrumented log
2024-04-03 14:05:46,852 [salt.state :2163][WARNING ][43416] DGM _call_parallel_target pre-instance states call, cdata '{'full': 'cmd.script', 'args': ['test.sh'], 'kwargs': {'success_s tderr': None, 'success_stdout': None, 'success_retcodes': None, 'context': None, 'defaults': None, 'hide_output': False, 'output_loglevel': 'debug', 'use_vt': False, 'timeout': None, 'statef ul': False, 'env': None, 'shell': '/bin/bash', 'password': None, 'runas': None, 'cwd': None, 'template': None, 'source': 'salt://scripts/test.sh'}}', tag 'cmd_|-test_|-test.sh_|-script'
547 2024-04-03 14:05:46,855 [salt.loader.lazy :32 ][TRACE ][43416] Loaded cmdmod as virtual cmd
548 2024-04-03 14:05:46,855 [salt.loader.lazy :960 ][DEBUG ][43416] The functions from module 'cmd' are being loaded by dir() on the loaded module
549 2024-04-03 14:05:46,858 [salt.utils.lazy :99 ][DEBUG ][43416] LazyLoaded cmd.script
550
551 2024-04-03 14:05:46,859 [salt.loaded.int.module.cmdmod:2856][WARNING ][43416] DGM cmd script entry, source 'salt://scripts/test.sh', args 'None', cwd 'None', stdin 'None', runas 'None', grou p 'None', shell '/bin/bash', python_shell 'True', env 'None', template 'None', umask 'None', output_encoding 'None', output_loglevel 'debug', log_callback 'None', hide_output 'False', timeou t 'None', reset_system_locale 'True', saltenv 'base', use_vt 'False', bg 'False', password 'None', success_retcodes 'None', success_stdout 'None', success_stderr 'None', kwargs '{'context': {}}'
552 2024-04-03 14:05:46,859 [salt.loaded.int.module.cmdmod:2880][WARNING ][43416] DGM cmd script 1
553 2024-04-03 14:05:46,859 [salt.loaded.int.module.cmdmod:2890][WARNING ][43416] DGM cmd script 2
554 2024-04-03 14:05:46,860 [salt.loaded.int.module.cmdmod:2896][WARNING ][43416] DGM cmd script 3
555 2024-04-03 14:05:46,860 [salt.loaded.int.module.cmdmod:2916][WARNING ][43416] DGM cmd script - no template, source 'salt://scripts/test.sh', saltenv 'base'
556 2024-04-03 14:05:46,862 [salt.loader.lazy :960 ][DEBUG ][43416] The functions from module 'cp' are being loaded by dir() on the loaded module
557 2024-04-03 14:05:46,863 [salt.utils.lazy :99 ][DEBUG ][43416] LazyLoaded cp.cache_file
558 2024-04-03 14:05:46,865 [salt.channel.client:32 ][TRACE ][43416] ReqChannel send crypt load={'path': 'scripts/test.sh', 'saltenv': 'base', 'cmd': '_file_hash'}
559
560 2024-04-03 14:05:47,456 [salt.utils.schedule:32 ][TRACE ][42691] ==== evaluating schedule now None =====
561 2024-04-03 14:05:48,445 [salt.utils.schedule:32 ][TRACE ][42691] ==== evaluating schedule now None =====
562 2024-04-03 14:05:49,448 [salt.utils.schedule:32 ][TRACE ][42691] ==== evaluating schedule now None =====
563 2024-04-03 14:05:50,451 [salt.utils.schedule:32 ][TRACE ][42691] ==== evaluating schedule now None =====
564 2024-04-03 14:05:50,525 [salt.channel.client:32 ][TRACE ][42691] Decoding payload: {'enc': 'aes', 'load': b'n`~d\xee\xa1\xees\xcc\xeft\xc7\x0b\x06\x1ert\xb9Q%>\xc1i\xa9\xe1\xf1a\x93\x14\x ae\x06\x17\xf3\xdd\xe5/\xa8^\xf2^\xb2\xf6\x02\x0e\xe4\xaaHR\xe2\x9c\x8aA|d\xbf\x91\xc8\x0eS\x0f\x17\xfd3\xaa G}M\xf6\xc1\xc7\x9f\xe9\x98\xcf\x80\xc3\xf0Q\x1a8\x13\xb53\x185\x02\x0e\x9d--\x04 \xd8\x97\xf9\x8eC\rw\xad\xba{\xcf@\xdb\xc4\x89\x9b\x99WB\xeb\x03e\x82\xc5\xeeoN&2\xe6Y\x85\xbf\xf5\xb9\xbc2\xb8`\xc6=\xd1Y\x10\x16\xdb\xb4E\xe4\xf3\x00J\xd6~\xe7 \xcf\x92\xc4\x16\xa5\xfe\x0f \x98\x9e\x13o\x16\x98\x98-\x17s\xf5\xc7zo*\xea\xd8tw\xf8\xda\xb8<.\x9d\x87C\x02a\xa0\x8d\xc1\x03\xc6\xe0q)', 'sig': b'a\xe1\xb3d\x8b\xd0\xe7\xa3\xba\x92\x96\xe3*!\xd1\xfa\x8c\xca\x17pW\xc1~\ xbb<\xb3\xa6\xf4\xb1\x89\xdcn$g\xe2\xa5\x9b.\xaf\x8f\xbeL\x13\xbed\xe0\xad\xfd\xae\xcd\x0c\x9dof\xfa\xf1\xf5s*\x19GL\x1a\xecs\x1d\xa3\x91\x91\xaf\x905\xd4\x18*~\xc1+dQ\x9a\x12\x0fY\xa6\x84\x a8\r\xd0\xfa\x15t\xd4O\x8bk\xa6&\xf3ok\xac\xf9?U\x95\x84\xc2\xe3A\xf8o_\x1d\xd9\xb6r\xf7\x94\x05\xcd\xda\x02\xbfR\xec\xc0\x02\xbb\xcd\xb4:y\xfa\xa9\x08}\x8c;\x00\xc5\xaa4\x9e\xb5\xd9\xeen,\x 05\x18\x8f\x0f\xaa%[}\x1a\xa3/M\xdf\x11\xf1\\\xa9\xf1P\x04\r\x9e,\x8d\x01\x87rQSM\\^\xceU\xcf\xc4\xd0.\x1d6\xaa\xc1\x18n\xa6\xb3\xce\xd0\x14\x95{\'\x0c\xac&\xea\xf7\xf1c\xce=\xbb>\xbb\x10:.: a\x89x\xd3\xd3\xe4"\xaa\xc6J\x0b^{=\x9bc`cH[\xccN\xb4N\xe6\x8b\xc8\x12\xc9\xbb\x1b\xcf\x98k\xa2\xcf-\xdd#'}
565 2024-04-03 14:05:50,526 [salt.channel.client:483 ][DEBUG ][42691] PubChannel received: {'enc': 'aes', 'load': {'fun': 'saltutil.find_job', 'arg': ['20240403200545157394'], 'tgt': ['tc8'], 'jid': '20240403200550254403', 'ret': '', 'tgt_type': 'list', 'user': 'sudo_david'}, 'sig': b'a\xe1\xb3d\x8b\xd0\xe7\xa3\xba\x92\x96\xe3*!\xd1\xfa\x8c\xca\x17pW\xc1~\xbb<\xb3\xa6\xf4\xb1\x89 \xdcn$g\xe2\xa5\x9b.\xaf\x8f\xbeL\x13\xbed\xe0\xad\xfd\xae\xcd\x0c\x9dof\xfa\xf1\xf5s*\x19GL\x1a\xecs\x1d\xa3\x91\x91\xaf\x905\xd4\x18*~\xc1+dQ\x9a\x12\x0fY\xa6\x84\xa8\r\xd0\xfa\x15t\xd4O\x 8bk\xa6&\xf3ok\xac\xf9?U\x95\x84\xc2\xe3A\xf8o_\x1d\xd9\xb6r\xf7\x94\x05\xcd\xda\x02\xbfR\xec\xc0\x02\xbb\xcd\xb4:y\xfa\xa9\x08}\x8c;\x00\xc5\xaa4\x9e\xb5\xd9\xeen,\x05\x18\x8f\x0f\xaa%[}\x1 a\xa3/M\xdf\x11\xf1\\\xa9\xf1P\x04\r\x9e,\x8d\x01\x87rQSM\\^\xceU\xcf\xc4\xd0.\x1d6\xaa\xc1\x18n\xa6\xb3\xce\xd0\x14\x95{\'\x0c\xac&\xea\xf7\xf1c\xce=\xbb>\xbb\x10:.:a\x89x\xd3\xd3\xe4"\xaa\ xc6J\x0b^{=\x9bc`cH[\xccN\xb4N\xe6\x8b\xc8\x12\xc9\xbb\x1b\xcf\x98k\xa2\xcf-\xdd#'} <bound method Minion._handle_payload of <salt.minion.Minion object at 0x7fd642e2f2e0>>
566 2024-04-03 14:05:50,526 [salt.channel.client:485 ][DEBUG ][42691] DGM PubChannel received decoded: {'enc': 'aes', 'load': {'fun': 'saltutil.find_job', 'arg': ['20240403200545157394'], 'tgt ': ['tc8'], 'jid': '20240403200550254403', 'ret': '', 'tgt_type': 'list', 'user': 'sudo_david'}, 'sig': b'a\xe1\xb3d\x8b\xd0\xe7\xa3\xba\x92\x96\xe3*!\xd1\xfa\x8c\xca\x17pW\xc1~\xbb<\xb3\xa6 \xf4\xb1\x89\xdcn$g\xe2\xa5\x9b.\xaf\x8f\xbeL\x13\xbed\xe0\xad\xfd\xae\xcd\x0c\x9dof\xfa\xf1\xf5s*\x19GL\x1a\xecs\x1d\xa3\x91\x91\xaf\x905\xd4\x18*~\xc1+dQ\x9a\x12\x0fY\xa6\x84\xa8\r\xd0\xfa \x15t\xd4O\x8bk\xa6&\xf3ok\xac\xf9?U\x95\x84\xc2\xe3A\xf8o_\x1d\xd9\xb6r\xf7\x94\x05\xcd\xda\x02\xbfR\xec\xc0\x02\xbb\xcd\xb4:y\xfa\xa9\x08}\x8c;\x00\xc5\xaa4\x9e\xb5\xd9\xeen,\x05\x18\x8f\x 0f\xaa%[}\x1a\xa3/M\xdf\x11\xf1\\\xa9\xf1P\x04\r\x9e,\x8d\x01\x87rQSM\\^\xceU\xcf\xc4\xd0.\x1d6\xaa\xc1\x18n\xa6\xb3\xce\xd0\x14\x95{\'\x0c\xac&\xea\xf7\xf1c\xce=\xbb>\xbb\x10:.:a\x89x\xd3\x d3\xe4"\xaa\xc6J\x0b^{=\x9bc`cH[\xccN\xb4N\xe6\x8b\xc8\x12\xc9\xbb\x1b\xcf\x98k\xa2\xcf-\xdd#'}
567 2024-04-03 14:05:50,526 [salt.channel.client:486 ][DEBUG ][42691] DGM PubChannel received callback: <bound method Minion._handle_payload of <salt.minion.Minion object at 0x7fd642e2f2e0>>
568 2024-04-03 14:05:50,526 [salt.channel.client:489 ][WARNING ][42691] DGM PubChannel received callback and decode, pre await callback-decode
569 2024-04-03 14:05:51,539 [salt.loader.lazy :960 ][DEBUG ][42691] The functions from module 'list_match' are being loaded by dir() on the loaded module
570 2024-04-03 14:05:51,539 [salt.utils.lazy :99 ][DEBUG ][42691] LazyLoaded list_match.match
571 2024-04-03 14:05:51,539 [salt.minion :1746][WARNING ][42691] DGM _handle_decoded_payload entry, data '{'fun': 'saltutil.find_job', 'arg': ['20240403200545157394'], 'tgt': ['tc8'], 'jid' : '20240403200550254403', 'ret': '', 'tgt_type': 'list', 'user': 'sudo_david'}'
572 2024-04-03 14:05:51,539 [salt.minion :1750][INFO ][42691] User sudo_david Executing command saltutil.find_job with jid 20240403200550254403
573 2024-04-03 14:05:51,540 [salt.minion :1758][DEBUG ][42691] Command details {'fun': 'saltutil.find_job', 'arg': ['20240403200545157394'], 'tgt': ['tc8'], 'jid': '20240403200550254403', 'ret': '', 'tgt_type': 'list', 'user': 'sudo_david'}
574 2024-04-03 14:05:51,540 [salt.minion :32 ][TRACE ][42691] Started JIDs: ['20240403200545157394']
575 2024-04-03 14:05:51,540 [salt.minion :1810][WARNING ][42691] DGM _handle_decoded_payload, data '{'fun': 'saltutil.find_job', 'arg': ['20240403200545157394'], 'tgt': ['tc8'], 'jid': '202 40403200550254403', 'ret': '', 'tgt_type': 'list', 'user': 'sudo_david'}', name 'ProcessPayload(jid=20240403200550254403)', multiprocessing_enabled 'True'
576 2024-04-03 14:05:51,543 [salt.minion :1841][WARNING ][42691] DGM _handle_decoded_payload, multiprocessing_enabled 'True', pre-add subprocess_list process pid '43644', name 'ProcessPaylo ad(jid=20240403200550254403)'
577 2024-04-03 14:05:51,544 [salt.utils.process:1146][DEBUG ][42691] Subprocess ProcessPayload(jid=20240403200550254403) added
578 2024-04-03 14:05:51,545 [salt.minion :1843][WARNING ][42691] DGM _handle_decoded_payload, multiprocessing_enabled 'True', post-add subprocess_list process pid '43644', name 'ProcessPayl oad(jid=20240403200550254403)'
579 2024-04-03 14:05:51,545 [salt.channel.client:495 ][WARNING ][42691] DGM got a typeerror exception 'object NoneType can't be used in 'await' expression'
580 2024-04-03 14:05:51,545 [salt.transport.zeromq:396 ][ERROR ][42691] Exception while running callback
581 Traceback (most recent call last):
582 File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/transport/zeromq.py", line 394, in consume
583 await callback(msg)
584 File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/channel/client.py", line 491, in wrap_callback
585 await callback(decoded)
586 TypeError: object NoneType can't be used in 'await' expression
587 2024-04-03 14:05:51,547 [salt.utils.schedule:32 ][TRACE ][42691] ==== evaluating schedule now None =====
see how timestamp 2024-04-03 14:05:46,865 is handled noting that all of this works fine when no master is involved (salt-call --local state.apply test).
Have a feeling that since the jid is not different, the mapping gets stuff delivered to the wrong pid, that is, the parallel pid is using the originating jid and thus sowing confusion. - At least that is an avenue of investigation.
The line https://github.com/saltstack/salt/blob/1998ba3424a56c1fc00a5845f447ca6ea9f8a514/salt/modules/cmdmod.py#L2893 is excuted but the following line is never reached, that is, no return from the __salt__["cp.cache_file"](source, saltenv)
This appears to be an issue with the master only recognizing the originating pid, e.g. pid 26942, the parallel pid 26958, but all the master talks about is 26942, for example:
240 2024-04-04 13:55:51,966 [salt.utils.job :86 ][INFO ][25886] Got return from tc8 for job 20240404195551601792
241 2024-04-04 13:55:51,967 [salt.utils.event :823 ][DEBUG ][25886] Sending event: tag = salt/job/20240404195551601792/ret/tc8; data = {'cmd': '_return', 'id': 'tc8', 'success': True, 'return': {'pid': 26942, 'fun': 'state.apply', 'arg': ['test'], 'tgt': 'tc8', 'ji d': '20240404195546199368', 'ret': '', 'tgt_type': 'glob', 'user': 'sudo_david'}, 'retcode': 0, 'jid': '20240404195551601792', 'fun': 'saltutil.find_job', 'fun_args': ['20240404195546199368'], 'user': 'sudo_david', '_stamp': '2024-04-04T19:55:51.967017'}
242 2024-04-04 13:55:51,976 [salt.loaded.int.returner.local_cache:307 ][DEBUG ][25886] Reading minion list from /var/cache/salt/master/jobs/bf/47776492ffffb096e5bf47c4369f137e838a8ae5fc45e27d2d1da7778d7053/.minions.p
243 2024-04-04 13:55:51,976 [salt.master :32 ][TRACE ][25886] Master function call _return took 0.012319564819335938 seconds
244 2024-04-04 13:55:51,978 [salt.utils.event :32 ][TRACE ][26924] get_event() caching unwanted event = {'data': {'cmd': '_return', 'id': 'tc8', 'success': True, 'return': {'pid': 26942, 'fun': 'state.apply', 'arg': ['test'], 'tgt': 'tc8', 'jid': '2024040419554619 9368', 'ret': '', 'tgt_type': 'glob', 'user': 'sudo_david'}, 'retcode': 0, 'jid': '20240404195551601792', 'fun': 'saltutil.find_job', 'fun_args': ['20240404195546199368'], 'user': 'sudo_david', '_stamp': '2024-04-04T19:55:51.967017'}, 'tag': 'salt/job/20240404195 551601792/ret/tc8'}
245 2024-04-04 13:55:51,978 [salt.utils.event :32 ][TRACE ][26924] _get_event() waited 0 seconds and received nothing
246 2024-04-04 13:55:51,979 [salt.utils.event :32 ][TRACE ][26924] Get event. tag: salt/job/20240404195551601792
247 2024-04-04 13:55:51,979 [salt.utils.event :32 ][TRACE ][26924] get_event() returning cached event = {'data': {'cmd': '_return', 'id': 'tc8', 'success': True, 'return': {'pid': 26942, 'fun': 'state.apply', 'arg': ['test'], 'tgt': 'tc8', 'jid': '2024040419554619 9368', 'ret': '', 'tgt_type': 'glob', 'user': 'sudo_david'}, 'retcode': 0, 'jid': '20240404195551601792', 'fun': 'saltutil.find_job', 'fun_args': ['20240404195546199368'], 'user': 'sudo_david', '_stamp': '2024-04-04T19:55:51.967017'}, 'tag': 'salt/job/20240404195 551601792/ret/tc8'}
Wondering about that caching unwanted event , could it be due to something from pid 26958
The pid the master is dealing with is due to the class SaltEvent.connect_pub setting self.subscriber, so betting using the connection it has with pid 26942 that it knows about, since the pid 26958 didn't go about make any connections (betting - caching ideas before going on vacation).
Not wanting to have to do a whole new connect_pub etc, perhaps the solution is to have the original pid 26942, check replies to see if it has children that need to receive whatever packet it gets.
Also the systemctl stop salt-minion not working, only killing the initial pid and none of the children, see ProcessManager.kill_children, only getting log.trace("Waiting to kill process manager children") in the logs, see
the process manager kill_children has an empty self._process_map when called, but clearly it should not, except if you remove the parallel part of the state file, and re-run the command and do the systemctl stop salt-minion, it too calls kill_children with and empty self._process_map.
However it does stop all salt-minion processes, which was
[root@localhost salt]# ps -ef | grep salt-minion
root 14832 1 0 13:32 ? 00:00:00 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-minion
root 14839 14832 2 13:32 ? 00:00:01 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-minion MultiMinionProcessManager MinionProcessManager