[BUG] Master failback fails to work if none of the master's can be resolved
Description If a minion is setup in Multi-Master mode and each master is a domain name and none of the domain names can be resolved then the minion only continues to try the last master and never attempts to try the first one again, even if the master_failback parameter is set.
Setup minion config:
master:
- examplehostname
- examplehostanme.local
master_type: failover
master_failback: True
retry_dns: 0
Please be as specific as possible and give set-up details.
- [x] on-prem machine
- [ ] VM (Virtualbox, KVM, etc. please specify)
- [ ] VM running on a cloud service, please be explicit and add details
- [ ] container (Kubernetes, Docker, containerd, etc. please specify)
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [ ] classic packaging
- [ ] onedir packaging
- [x] used bootstrap to install
Steps to Reproduce the behavior
- Don't setup master on network to simulate master being down or disconnected.
- Setup minion with the configuration file above and run salt-minion.
Expected behavior Minion fails back to trying to resolve first master if it cannot resolve the last master (because the first master might now be up).
Versions Report
salt --versions-report
No difference in salt versions between master and minion.Salt Version:
Salt: 3007.0
Python Version:
Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]
Dependency Versions:
cffi: 1.16.0
cherrypy: 18.8.0
dateutil: 2.8.2
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: Not Installed
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.7
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 23.1
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: Not Installed
python-gnupg: 0.5.2
PyYAML: 6.0.1
PyZMQ: 25.1.2
relenv: 0.15.1
smmap: Not Installed
timelib: 0.3.0
Tornado: 6.3.3
ZMQ: 4.3.4
Salt Package Information:
Package Type: onedir
System Versions:
dist: ubuntu 22.04.4 jammy
locale: utf-8
machine: x86_64
release: 6.5.0-28-generic
system: Linux
version: Ubuntu 22.04.4 jammy
Additional context I help to manage 50 laptops we use for various events. The setup has to be flexible and work on different networks so we try to use multicast DNS names for resolution. Some networks don't support mDNS but do resolve the hostname. Therefore, we've found decent success by including both the salt-master's hostname and its hostname.local. Unfortunately neither of these name resolution techniques are very reliable so it would be useful for the salt minions to continue to try both rather than just the last one.
Potentially why this is occurring Without diving too deep into the code base here is what I've observing:
- At line 687 in Minion.py,
opts["master"]which originally was a list is set to just one of the masters:opts["master"] = master - Since none of the master names get resolved the error on line 702 is raised:
raise SaltClientError(msg) - The coroutine waits according to
acceptance_wait_timeparameter in minion config - The routine loop repeats and the eval_master is called and since
opts["master"]is now a string the conditional on line 600:elif isinstance(opts["master"], str) and ("master_list" not in opts):is taken instead of the failed conditional on line 611:elif failed:which would setopts["master"]back to the list.
Potential Solution
I don't plan on opening a pull request since I am not familiar enough with Salt to know if this break anything else but changing line 600 in minion.py to elif isinstance(opts["master"], str) and ("master_list" not in opts) and not failed: seemed to fix the issue.
Temporary Workaround Adding an IP address such as 127.0.0.1 to the list of masters fixes this issue.