salt
salt copied to clipboard
[BUG] intermittent connection between master and minion
Description
A clear and concise description of what the bug is.
I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn't connect to them anymore after a while. salt '*' test.ping
failed with the following error message:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20230920213139507242
here are a few observations:
- restarting the salt-minion service helped but the same minion lost connection again after a while.
-
salt-call test.ping
works fine on minion side. other commands likesalt-call state.apply
also works fine. this indicates minion to master communication is fine but master to minion communication is not - below is the error message i found from minion log. i tried to bump up the timeout param like salt '*' -t 600 test.ping . but it doesn't help
2023-09-20 13:00:08,121 [salt.minion :2733][ERROR ][821760] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20230920195941006337', 'fun': 'test.ping', 'fun_args': [], 'user': 'root', '_stamp': '2023-09-20T19:59:41.114944', 'nonce': '3b23a38761fc4e98a694448d36ac7f97'} request
does anyone have any idea what's wrong here and how to debug this issue?
Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
- minion was installed by
sudo ./bootstrap-salt.sh -A <master-ip-address> -i $(hostname) stable 3006.3
. no custom config on minion - master runs inside a container using image
saltstack/salt:3006.3
. master configs:
nodegroups:
prod-early-adopter: L@minion-hostname-1
prod-general-population: L@minion-hostname-2
release: L@minion-hostname-3
custom: L@minion-hostname-4
file_roots:
base:
- <path/to/custom/state/file>
state file:
pull_state_job:
schedule.present:
- function: state.apply
- maxrunning: 1
- when: 8:00pm
deploy:
cmd.run:
- name: '<custom-command-here>'
- runas: ubuntu
Please be as specific as possible and give set-up details.
- [x] on-prem machine
- [ ] VM (Virtualbox, KVM, etc. please specify)
- [ ] VM running on a cloud service, please be explicit and add details
- [x] container (Kubernetes, Docker, containerd, etc. please specify)
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [ ] classic packaging
- [ ] onedir packaging
- [x] used bootstrap to install
Steps to Reproduce the behavior (Include debug logs if possible and relevant)
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Salt Version:
Salt: 3006.3
Python Version:
Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]
Dependency Versions:
cffi: 1.14.6
cherrypy: unknown
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.2
libgit2: Not Installed
looseversion: 1.0.2
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.2
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 22.0
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.9.8
pygit2: Not Installed
python-gnupg: 0.4.8
PyYAML: 6.0.1
PyZMQ: 23.2.0
relenv: Not Installed
smmap: Not Installed
timelib: 0.2.4
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: alpine 3.14.6
locale: utf-8
machine: x86_64
release: 5.11.0-1022-aws
system: Linux
version: Alpine Linux 3.14.6
Additional context Add any other context about the problem here.