salt
salt copied to clipboard
[BUG] 3007 master 4505 publish port failed to respond to salt commands unless restart
Description
3007 master not responding salt commands unless restarting the master. But it does answer salt-call ran from minion The timeout issue does not have log in /var/log/salt/master file with log file level set to debug. Currently trying the trace level.
log_level_logfile: trace
Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)
Please be as specific as possible and give set-up details.
- [ x] on-prem machine
- [ x] VM (Virtualbox, KVM, etc. please specify)
- [ ] VM running on a cloud service, please be explicit and add details
- [ ] container (Kubernetes, Docker, containerd, etc. please specify)
- [ ] or a combination, please be explicit
- [ ] jails if it is FreeBSD
- [ ] classic packaging
- [ x] onedir packaging
- [ x] used bootstrap to install
Steps to Reproduce the behavior $sudo salt minion-on-saltmaster test.ping
Expected behavior salt-master should answer the test.ping command quickly. In stead,it timeout.
Screenshots
- cmd line
[me@salt01 ~]$ sudo salt salt01 test.ping
salt01:
Minion did not return. [No response]
The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
salt-run jobs.lookup_jid 20240328074306181473
-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 1
# of minions returned: 0
# of minions that did not return: 1
# of minions with errors: 0
-------------------------------------------
ERROR: Minions returned with non-zero exit code
[me@salt01 ~]$
- and this is trace log in /etc/salt/master when a minion not responding.
2024-03-28 02:36:29,940 [salt.loader.lazy ][DEBUG ] The functions from module 'no_return' are being loaded by dir() on the loaded module
2024-03-28 02:36:29,940 [salt.utils.lazy ][DEBUG ] LazyLoaded no_return.output
2024-03-28 02:36:29,941 [salt.output ][TRACE ] data = {'salt01': 'Minion did not return. [No response]\nThe minions may not have all finished running and any remaining minions will return upon completion.
To look up the return data for this job later, run the following command:\n\nsalt-run jobs.lookup_jid 20240328073609814061'}
- salt-master status
Mar 27 13:09:15 salt01 salt-master[2162920]: [ERROR ] Publish server binding pub to /var/run/salt/master/master_event_pub.ipc ssl=None
Mar 27 14:00:04 salt01 salt-master[2163029]: [ERROR ] Requested method not exposed: _ext_nodes
Mar 27 15:00:04 isalt01 salt-master[2163274]: [ERROR ] Requested method not exposed: _ext_nodes
<snipped>
Mar 28 03:00:04 salt01 salt-master[2163147]: [ERROR ] Requested method not exposed: _ext_nodes
Mar 28 04:00:07 salt01 salt-master[2163282]: [ERROR ] Requested method not exposed: _ext_nodes
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Salt Version:
Salt: 3007.0
Python Version:
Python: 3.10.13 (main, Feb 19 2024, 03:31:20) [GCC 11.2.0]
Dependency Versions:
cffi: 1.16.0
cherrypy: unknown
dateutil: 2.8.2
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.1.3
libgit2: Not Installed
looseversion: 1.3.0
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.7
msgpack-pure: Not Installed
mysql-python: Not Installed
packaging: 23.1
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.19.1
pygit2: Not Installed
python-gnupg: 0.5.2
PyYAML: 6.0.1
PyZMQ: 25.1.2
relenv: 0.15.1
smmap: Not Installed
timelib: 0.3.0
Tornado: 6.3.3
ZMQ: 4.3.4
Salt Package Information:
Package Type: onedir
System Versions:
dist: rocky 8.9 Green Obsidian
locale: utf-8
machine: x86_64
release: 4.18.0-513.18.1.el8_9.x86_64
system: Linux
version: Rocky Linux 8.9 Green Obsidian
Additional context Add any other context about the problem here.
@KiyoIchikawa, do you still have this issue ?
- So push out command doesn't work unless rebooted while answering minion call is fine. Following are two logs showing the master vs client action.
/var/log/salt/master log with trace level enabled: after running sudo salt salt01 test.ping command it just time out on itsef , the master
2024-03-28 08:33:40,794 [salt.utils.event ][TRACE ] _get_event() waited 0 seconds and received nothing
2024-03-28 08:33:40,804 [salt.utils.event ][TRACE ] Get event. tag: (salt/job|syndic/.*)/20240328133320851221
<snipped>
2024-03-28 08:33:40,886 [salt.utils.event ][TRACE ] Get event. tag: salt/job/20240328133325868511
2024-03-28 08:33:40,886 [salt.utils.event ][TRACE ] _get_event() waited 0 seconds and received nothing
2024-03-28 08:33:40,887 [salt.client ][DEBUG ] return event: {'salt01': {'failed': True}}
2024-03-28 08:33:40,892 [salt.loader.lazy ][DEBUG ] The functions from module 'localfs' are being loaded by dir() on the loaded module
2024-03-28 08:33:40,893 [salt.utils.lazy ][DEBUG ] LazyLoaded localfs.init_kwargs
2024-03-28 08:33:40,980 [salt.loader.lazy ][DEBUG ] The functions from module 'localfs' are being loaded by dir() on the loaded module
2024-03-28 08:33:40,980 [salt.utils.lazy ][DEBUG ] LazyLoaded localfs.init_kwargs
2024-03-28 08:33:40,982 [salt.loader.lazy ][DEBUG ] The functions from module 'no_return' are being loaded by dir() on the loaded module
2024-03-28 08:33:40,982 [salt.utils.lazy ][DEBUG ] LazyLoaded no_return.output
2024-03-28 08:33:40,982 [salt.output ][TRACE ] data = {'salt01': 'Minion did not return. [No response]\nThe minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:\n\nsalt-run jobs.lookup_jid 20240328133320851221'}
2024-03-28 08:33:40,982 [salt.transport.tcp][DEBUG ] tcp stream to closed, unable to recv
2024-03-28 08:33:41,973 [salt.utils.process][TRACE ] Process manager iteration
2024-03-28 08:33:43,754 [salt.utils.process][TRACE ] Process manager iteration
2024-03-28 08:33:51,983 [salt.utils.process][TRACE ] Process manager iteration
2024-03-28 08:33:53,765 [salt.utils.process][TRACE ] Process manager iteratio
/var/log/salt/master log with trace level enabled: after running sudo salt-call test.ping command. salt master is answering querying.
``` 2024-03-28 08:43:33,533 [salt.crypt ][DEBUG ] salt.crypt.get_rsa_key: Loading private key 2024-03-28 08:43:33,533 [salt.crypt ][DEBUG ] salt.crypt.sign_message: Signing message. 2024-03-28 08:43:33,533 [salt.transport.tcp][TRACE ] TCP PubServer sending payload: topic_list=None 'salt/auth\n\n��resultãact�accept�id�salt01�pub�\x01�-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9xxxxx-masked--Z7HIgb+x+dyYNDe\nb3RFpYx7QFxzHmShDutLMyjfAR26npS8fsN1+nHcoXTkHC6kE2OjC0yc/rp5Crr6\nIQIDAQAB\n-----END PUBLIC KEY-----�_stamp�2024-03-28T13:43:33.532394' 2024-03-28 08:43:33,533 [salt.transport.tcp][TRACE ] TCP PubServer finished publishing payload 2024-03-28 08:43:33,533 [salt.master ][TRACE ] Ignore tag salt/auth 2024-03-28 08:43:33,581 [salt.master ][TRACE ] AES payload received with command _pillar 2024-03-28 08:43:33,581 [salt.pillar ][DEBUG ] Determining pillar cache 2024-03-28 08:43:33,600 [salt.loader.lazy ][DEBUG ] The functions from module 'roots' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,600 [salt.utils.lazy ][DEBUG ] LazyLoaded roots.envs 2024-03-28 08:43:33,602 [salt.loader.lazy ][DEBUG ] The functions from module 's3fs' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,603 [salt.utils.lazy ][DEBUG ] Could not LazyLoad roots.init: 'roots.init' is not available. 2024-03-28 08:43:33,623 [salt.loader.lazy ][DEBUG ] The functions from module 'jinja' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,623 [salt.utils.lazy ][DEBUG ] LazyLoaded jinja.render 2024-03-28 08:43:33,624 [salt.loader.lazy ][DEBUG ] The functions from module 'yaml' are being loaded by dir() on the loaded module 2024-03-28 08:43:33,624 [salt.utils.lazy ][DEBUG ] LazyLoaded yaml.render 2024-03-28 08:43:33,631 [salt.master ][TRACE ] Master function call _pillar took 0.05068349838256836 seconds 2024-03-28 08:43:33,632 [salt.transport.tcp][TRACE ] TCP PubServer sending payload: topic_list=None 'minion/refresh/salt01\n\n��Minion data cache refresh�salt01�_stamp�2024-03-28T13:43:33.630948' 2024-03-28 08:43:33,632 [salt.transport.tcp][TRACE ] TCP PubServer finished publishing payload 2024-03-28 08:43:33,632 [salt.master ][TRACE ] Ignore tag minion/refresh/salt01 2024-03-28 08:43:33,785 [salt.master ][TRACE ] AES payload received with command _return 2024-03-28 08:43:33,787 [salt.loaded.int.returner.local_cache][DEBUG ] Adding minions for job 20240328134333786284: ['salt01'] 2024-03-28 08:43:33,788 [salt.utils.job ][INFO ] Got return from salt01 for job 20240328134333786284 2024-03-28 08:43:33,789 [salt.transport.tcp][TRACE ] TCP PubServer sending payload: topic_list=None 'salt/job/20240328134333786284/ret/salt01\n\n��cmd�_return�id�salt01�jid�20240328134333786284�returnçretcode\x00�fun�test.ping�fun_args��arg��tgt_type�glob�tgt�salt01�_stamp�2024-03-28T13:43:33.788313' 2024-03-28 08:43:33,789 [salt.loaded.int.returner.local_cache][DEBUG ] Reading minion list from /var/cache/salt/master/jobs/1b/c46d64efe0c85f68fb92d7c9137b70db8237f991f39268f4018c423a3aa009/.minions.p 2024-03-28 08:43:33,789 [salt.transport.tcp][TRACE ] TCP PubServer finished publishing payload 2024-03-28 08:43:33,789 [salt.master ][TRACE ] Ignore tag salt/job/20240328134333786284/ret/salt01 2024-03-28 08:43:33,789 [salt.master ][TRACE ] Master function call _return took 0.003786802291870117 seconds 2024-03-28 08:43:34,247 [salt.utils.process][TRACE ] Process manager iteration</details>
We tested on Windows minions and Linux minions. The Windows minions seem to be working better, but they both seem to be getting the following error(s) while applying a state.test
.
Linux Minion (OEL 8)
2024-03-28 13:32:25,967 [salt.utils.extmods:155 ][ERROR ][1375064] Failed to sync modules module: File client timed out after 60 seconds
Windows (Windows Server 2016 Datacenter)
2024-03-28 12:50:34,269 [salt.loaded.int.beacons.adb:23 ][ERROR ][14288] Unable to load adb beacon: adb is missing.
2024-03-28 12:50:34,285 [salt.loaded.int.beacons.aix_account:23 ][ERROR ][14288] Unable to load aix_account beacon: Only available on AIX systems.
2024-03-28 12:50:34,332 [salt.loaded.int.beacons.avahi_announce:62 ][ERROR ][14288] Unable to load avahi_announce beacon: The 'python-avahi' dependency is missing.
2024-03-28 12:50:34,347 [salt.loaded.int.beacons.bonjour_announce:31 ][ERROR ][14288] Unable to load bonjour_announce beacon: pybonjour library is missing.
2024-03-28 12:50:34,347 [salt.loaded.int.beacons.btmp:134 ][ERROR ][14288] Unable to load btmp beacon: /var/log/btmp does not exist.
2024-03-28 12:50:34,519 [salt.loaded.int.beacons.glxinfo:23 ][ERROR ][14288] Unable to load glxinfo beacon: glxinfo is missing.
2024-03-28 12:50:37,862 [salt.loaded.int.beacons.haproxy:24 ][ERROR ][14288] Unable to load haproxy beacon: haproxy.get_sessions is missing.
2024-03-28 12:50:37,877 [salt.loaded.int.beacons.inotify:48 ][ERROR ][14288] Unable to load inotify beacon: pyinotify library is missing.
2024-03-28 12:50:37,904 [salt.loaded.int.beacons.journald:26 ][ERROR ][14288] Unable to load journald beacon: systemd library is missing.
2024-03-28 12:50:37,921 [salt.loaded.int.beacons.load:20 ][ERROR ][14288] Unable to load load beacon: Not available for Windows systems.
2024-03-28 12:50:37,931 [salt.loaded.int.beacons.log_beacon:37 ][ERROR ][14288] Unable to load log beacon: Not available for Windows systems or when regex library is missing.
2024-03-28 12:50:38,016 [salt.loaded.int.beacons.network_settings:86 ][ERROR ][14288] Unable to load network_settings beacon: pyroute2 library is missing
2024-03-28 12:50:38,079 [salt.loaded.int.beacons.sensehat:26 ][ERROR ][14288] Unable to load sensehat beacon: sensehat.get_pressure is missing.
2024-03-28 12:50:38,094 [salt.loaded.int.beacons.sh:24 ][ERROR ][14288] Unable to load sh beacon: strace is missing.
2024-03-28 12:50:38,141 [salt.loaded.int.beacons.smartos_imgadm:42 ][ERROR ][14288] Unable to load imgadm beacon: Only available on SmartOS compute nodes.
2024-03-28 12:50:38,173 [salt.loaded.int.beacons.smartos_vmadm:42 ][ERROR ][14288] Unable to load vmadm beacon: Only available on SmartOS compute nodes.
2024-03-28 12:50:38,204 [salt.loaded.int.beacons.telegram_bot_msg:30 ][ERROR ][14288] Unable to load telegram_bot_msg beacon: telegram library is missing.
2024-03-28 12:50:38,235 [salt.loaded.int.beacons.twilio_txt_msg:31 ][ERROR ][14288] Unable to load twilio_txt_msg beacon: twilio library is missing.
2024-03-28 12:50:38,235 [salt.loaded.int.beacons.watchdog:77 ][ERROR ][14288] Unable to load watchdog beacon: watchdog library is missing.
2024-03-28 12:50:38,251 [salt.loaded.int.beacons.wtmp:163 ][ERROR ][14288] Unable to load wtmp beacon: /var/log/wtmp does not exist.
2024-03-28 12:51:37,367 [salt.utils.extmods:155 ][ERROR ][19124] Failed to sync states module: File client timed out after 60 seconds
2024-03-28 12:53:55,975 [salt.utils.extmods:155 ][ERROR ][19124] Failed to sync output module: message authentication failed
2024-03-28 12:55:05,277 [salt.utils.extmods:155 ][ERROR ][19124] Failed to sync proxy module: File client timed out after 60 seconds
2024-03-28 12:56:14,469 [salt.utils.extmods:155 ][ERROR ][19124] Failed to sync serializers module: File client timed out after 60 seconds
The Linux minion keeps getting timeouts, I have not increased the log-level on the master yet.
@KiyoIchikawa , thanks for the update. your case above doesn't fit the issue I am facing here. When salt-master needed restart, it can't even issue command itself(sudo salt saltmaster01 test.ping" . And yet "saltmaster01$sudo salt-call test.ping" works.
Here is more info when salt commands can't be sent out.
- port 4505 failed the simple ncat port check.
[me@salt01 ~]$ date ;nc -zv salt01 4505;nc -zv salt01 4506
Fri Apr 5 11:09:23 CDT 2024
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: TIMEOUT.
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 100.65.184.160:4506.
Ncat: 0 bytes sent, 0 bytes received in 0.02 seconds.
[me@salt01 ~]$
- result of netstat -antulp |egrep 4505 salt01 salt-master
# ip address masked
tcp 1001 0 0.0.0.0:4505 0.0.0.0:* LISTEN 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.4.43.33:40224 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.4.116.28:38806 ESTABLISHED 2443253/salt-master
<snipped>
tcp 11 0 1.6.1.0:4505 1.4.81.5:60514 CLOSE_WAIT -
tcp 11 0 1.6.1.0:4505 1.4.81.5:40492 CLOSE_WAIT -
tcp 0 0 1.6.1.0:4505 1.4.44.3:52474 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.5.176.67:43538 ESTABLISHED 2443253/salt-master
tcp 11 0 1.6.1.0:4505 1.4.81.5:49436 CLOSE_WAIT -
tcp 0 0 1.6.1.0:4505 1.4.226.19:36712 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.5.178.42:35596 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.4.11.76:48254 ESTABLISHED 2443253/salt-master
tcp 0 0 1.6.1.0:4505 1.4.29.8:40660 ESTABLISHED 2443253/salt-master
Small comment, we're facing the exact same issue since upgrading from 3006.7. At some point the master stops working and cannot send anything to the minions.
Restarting it doesn't fix anything as it cannot bind to its port:
[salt.transport.tcp:1405][ERROR ][1609] Publish server binding pub to /var/run/salt/master/master_event_pub.ipc ssl=None
EDIT:
- OS: Ubuntu 22.04
- Machine type: baremetal
- Bootstrap used: initially yes, but upgrade from 3006.x to 3007
@compi-tom,
- I was in you current state running on RockyLinux 8 for past few weeks. For now I would suggest restart salt-master when zeromq pub 4505 resourced depleted or put a cronjob to restart daily.
- I was on a VMWare VM with luxury of taking snapshots for rolling back.
- I tried many things and ended up with an instance that can't even send out test.ping from salt-master to itself after a fresh reboot.
- In the end, I had to clone another running 3007.0( a two node multi-masters setup) that is more stable that only has this issue occasionally.
We're already running a 3 nodes multi-masters. Restarting the master service doesn't solve the issue as it cannot bind to the 4505 port ==> reboot.
I'm considering downgrading to 3006.7
I think I tried downgrade to 3006.7 once and it didn't help this issue so I decided use 3007.0 again. if downloading to older version approach is adopted the minion need to be downgraded also. Interesting thing is that good 3007 salt-master(one that got cloned) now has same port 4505 malfunction issue. a salt-master stop/start bring up the port 4505. They are many setting in /etc/salt/master config file. it is time for me to study and understand the meaning zeromq settings.
I am following this PR: https://github.com/saltstack/salt/pull/66335 , not really know the details. but this issue looks like a publish port issue that 66335 trying to fix.
- EventPublisher process use 1.6% of 16G, this is the time port 4505 failed.
[me@salt01 srv]$ egrep '286129|MEM' salt-2024-04-24-06\:56.txt
PID STARTED TIME %MEM RSS SZ VSZ SIZE
286129 01:00:02 00:00:01 1.6 273528 185245 740980 273156
286129 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher
[me@salt01 srv]$
- so bad that even port 4505 not responding to port poking by ncat.
[me@salt01 srv]$ nc -zv ilclsalt01 4505
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: TIMEOUT.
[me@salt01 srv]$
- This is my effort using makefile to automate the tracking of EventPublish memory usage when it stop working. ie , putting "make memory" into hourly cronjob to record the memory usage for salt-master sub-process.
[me@salt01 srv]$ egrep 'DATEH|memory|pgrep' Makefile
DATEH = $(shell date +"%Y-%m-%d-%R")
memory:
ps -o pid,start,time,%mem,rss,sz,vsz,size --sort -rss -p $$(pgrep -f salt-master) | head > salt-$(DATEH).txt
pgrep -af salt-master |grep -v grep >> salt-$(DATEH).txt
[me@salt01 srv]$
- sample of memory usage on a 16G setting when EventPublisher stopped working when it reached 1.6% of memory.
[me@salt01 srv]$ egrep '1\.6|MEM' salt-2024-04-25-06\:35.txt
PID STARTED TIME %MEM RSS SZ VSZ SIZE
896442 01:00:02 00:00:01 1.6 263128 182305 729220 261396 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher
[me@salt01 srv]$
- This is memory footprints after salt-master got restarted after 2 hours.
[me@salt01 srv]$ cat salt-2024-04-25-09\:06.txt
PID STARTED TIME %MEM RSS SZ VSZ SIZE
1083486 06:39:33 00:00:01 0.8 131836 277801 1111204 213052
1083479 06:39:33 00:00:02 0.7 118252 165078 660312 128200
1083489 06:39:33 00:00:17 0.5 82504 212527 850108 124024
1083480 06:39:33 00:00:01 0.5 81732 137755 551020 83196
1083501 06:39:33 00:00:19 0.5 81552 212266 849064 122900
1083490 06:39:33 00:00:17 0.5 81484 212239 848956 122908
1083503 06:39:33 00:00:19 0.5 81184 211934 847736 121652
1083505 06:39:33 00:00:18 0.5 81148 212192 848768 122640
1083492 06:39:33 00:00:18 0.5 80784 211915 847660 121340
1083287 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master MainProcess
1083479 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master PubServerChannel._publish_daemon
1083480 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventPublisher
1083483 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master EventMonitor
1083485 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer ReqServer_ProcessManager
1083486 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorkerQueue
1083488 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-0
1083489 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-1
1083490 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-2
1083492 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-3
1083501 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-4
1083503 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-5
1083504 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-6
1083505 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master ReqServer MWorker-7
1125202 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master FileserverUpdate
1125292 /opt/saltstack/salt/bin/python3.10 /usr/bin/salt-master Maintenance
[me@salt01 srv]$
my 2nd salt-master failed to test.ping itself, restart salt-master doesn't work anymore. I had to reboot the RL8 OS. this definitely smell like kernel network resource depletion.
- OS restart log below.
[me@salt02 ~]$ sudo salt salt02 test.ping
salt02:
True
-------------------------------------------
Summary
-------------------------------------------
# of minions targeted: 1
# of minions returned: 1
# of minions that did not return: 0
# of minions with errors: 0
-------------------------------------------
[me@salt02 ~]$ uptime
06:11:19 up 4 min, 2 users, load average: 0.66, 0.29, 0.12
[me@salt02 ~]$
- curren zeromq setting in /etc/salt/master, Need to read https://libzmq.readthedocs.io/en/latest/
[me@salt02 ~]$ sudo egrep "^pub_hwm|^ipc_write_buffer|^salt_event_pub_hwm|^event_publisher_pub_hwm" /etc/salt/master
pub_hwm: 10000
ipc_write_buffer: 'dynamic'
salt_event_pub_hwm: 128000
event_publisher_pub_hwm: 64000
[me@salt02 ~]$
After downgrade from 3007 to 3006.8 version, my salt-master servers, the 4505 port is now stable and able to send out salt commands.
After downgrade from 3007 to 3006.8 version, my salt-master servers, the 4505 port is now stable and able to send out salt commands.
I've done the same as this error was preventing our app to automatically add VMs to cope with the load. Also, I didn't want to set a daily restart to "bypass" the issue.
@compi-tom, Thanks for the confirmation. Hopeful when 3007 turned into LTS from STS. This issue can be acked by core team and resolved.
Any update on this issue ? I had two servers that were working as multimasters nodes thru multiple 3006 versions, just upgraded to 3007.1 and I get this problem in one of the servers only
[DEBUG ] PubClient connecting to <salt.transport.tcp.PublishClient object at 0x7f721b129130> '/local/data/saltwin/salt-master/3007.1/var/run/salt/master/master_event_pub.ipc'
[DEBUG ] PubClient conencted to <salt.transport.tcp.PublishClient object at 0x7f721b129130> '/local/data/saltwin/salt-master/3007.1/var/run/salt/master/master_event_pub.ipc'
[DEBUG ] Subscriber at connected
[DEBUG ] tcp stream to closed, unable to recv
We also had to downgrade our master to 3006.8 to resolve this issue. 3007.0 and 3007.1 did not work and we had to restart salt-master multiple times every day.
No logs are written with Loglevel WARNING. Master just refuses to connect to any minion (even itself). Salt-Call works without issues.
Hello,
Today i made update from latest 3005 to 3007.1 and i got a lot of messages in master log file: `2024-08-25 10:10:10,192 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv
2024-08-25 10:10:10,192 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected
2024-08-25 10:10:10,193 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv
2024-08-25 10:10:10,193 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected
2024-08-25 10:10:15,199 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv
2024-08-25 10:10:15,199 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected
2024-08-25 10:10:15,200 [salt.transport.tcp:1082][DEBUG ][81648] tcp stream to closed, unable to recv
2024-08-25 10:10:15,201 [salt.transport.tcp:1102][DEBUG ][81648] Subscriber at connected `
It seems 3007.1 is not stable and we cannot use it because communication with minions does not work (we have about 2500 minions connected to the master) After rollback to latest 3006 all works okay.
Just for information that problem still exists.
Any fix around?
See also #66288 and potentially #66715 There are real issues on 3007+ with minion communications and very little in terms of response to all these.