salt icon indicating copy to clipboard operation
salt copied to clipboard

[BUG] intermittent connection between master and minion

Open qianguih opened this issue 9 months ago • 38 comments

Description A clear and concise description of what the bug is. I am seeing a weird connection issue in my salt setup. there are ~30 minions registered with the master. for a few of them, master couldn't connect to them anymore after a while. salt '*' test.ping failed with the following error message:

    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:
    
    salt-run jobs.lookup_jid 20230920213139507242

here are a few observations:

  • restarting the salt-minion service helped but the same minion lost connection again after a while.
  • salt-call test.ping works fine on minion side. other commands like salt-call state.apply also works fine. this indicates minion to master communication is fine but master to minion communication is not
  • below is the error message i found from minion log. i tried to bump up the timeout param like salt '*' -t 600 test.ping . but it doesn't help
2023-09-20 13:00:08,121 [salt.minion      :2733][ERROR   ][821760] Timeout encountered while sending {'cmd': '_return', 'id': 'minion', 'success': True, 'return': True, 'retcode': 0, 'jid': '20230920195941006337', 'fun': 'test.ping', 'fun_args': [], 'user': 'root', '_stamp': '2023-09-20T19:59:41.114944', 'nonce': '3b23a38761fc4e98a694448d36ac7f97'} request
does anyone have any idea what's wrong here and how to debug this issue?

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

  • minion was installed by sudo ./bootstrap-salt.sh -A <master-ip-address> -i $(hostname) stable 3006.3. no custom config on minion
  • master runs inside a container using image saltstack/salt:3006.3 . master configs:
nodegroups:
  prod-early-adopter: L@minion-hostname-1
  prod-general-population: L@minion-hostname-2
  release: L@minion-hostname-3
  custom: L@minion-hostname-4

file_roots:
  base:
    - <path/to/custom/state/file>

state file:

pull_state_job:
  schedule.present:
    - function: state.apply
    - maxrunning: 1
    - when: 8:00pm

deploy:
  cmd.run:
    - name: '<custom-command-here>'
    - runas: ubuntu

Please be as specific as possible and give set-up details.

  • [x] on-prem machine
  • [ ] VM (Virtualbox, KVM, etc. please specify)
  • [ ] VM running on a cloud service, please be explicit and add details
  • [x] container (Kubernetes, Docker, containerd, etc. please specify)
  • [ ] or a combination, please be explicit
  • [ ] jails if it is FreeBSD
  • [ ] classic packaging
  • [ ] onedir packaging
  • [x] used bootstrap to install

Steps to Reproduce the behavior (Include debug logs if possible and relevant)

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
Salt Version:
          Salt: 3006.3
 
Python Version:
        Python: 3.10.4 (main, Apr 20 2022, 01:21:48) [GCC 10.3.1 20210424]
 
Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.2
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.9.8
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: Not Installed
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4
 
System Versions:
          dist: alpine 3.14.6 
        locale: utf-8
       machine: x86_64
       release: 5.11.0-1022-aws
        system: Linux
       version: Alpine Linux 3.14.6 

Additional context Add any other context about the problem here.

qianguih avatar Sep 21 '23 22:09 qianguih