salt icon indicating copy to clipboard operation
salt copied to clipboard

[BUG] Master failed to authenticate message from minion, minion does not re-connect after master being offline

Open amalaguti opened this issue 1 year ago • 2 comments

Description The minion service does not re-connect to master after master was offline for some time. This is erratic, does not always happen the same way, some times it reconnects without issues.

From the minion, the minion gets results for commands like test.ping, but the same command from the master to the minion does not work.

In the event bus, events from the minion side returns are received. Only after restarting the minion it does reconnect properly

In the master log, there's an error about Failed to authenticate message, but looks like the auth seems to be accepted, but it does not self recover. Keeps getting this same message.

[DEBUG   ] salt.crypt.sign_message: Signing message.
[DEBUG   ] Failed to authenticate message
[DEBUG   ] Minion failed to auth to master. Since the payload is encrypted, it is not known which minion failed to authenticate. It is likely that this is a transient failure due to the master rotating its public key.
[DEBUG   ] Failed to authenticate message
[DEBUG   ] Minion failed to auth to master. Since the payload is encrypted, it is not known which minion failed to authenticate. It is likely that this is a transient failure due to the master rotating its public key.
[DEBUG   ] Failed to authenticate message
[DEBUG   ] Minion failed to auth to master. Since the payload is encrypted, it is not known which minion failed to authenticate. It is likely that this is a transient failure due to the master rotating its public key.
[INFO    ] Authentication request from vesselsim-win-ems-1
[INFO    ] Authentication accepted from vesselsim-win-ems-1
PS C:\Users\adrian> salt-call status.ping_master 172.24.0.4
local:
    True

PS C:\Users\adrian> salt-call status.master master=172.24.0.4
local:
    True

PS C:\Users\adrian> salt-call status.ping_master 172.24.0.4
local:
    True

PS C:\Users\adrian> salt-call test.ping
local:
    True
[root@vesselsim ~]# salt vesselsim-win-ems-1 test.ping
vesselsim-win-ems-1:
    Minion did not return. [No response]
    The minions may not have all finished running and any remaining minions will return upon completion. To look up the return data for this job later, run the following command:

    salt-run jobs.lookup_jid 20240416170654648177
ERROR: Minions returned with non-zero exit code

Setup 3006.1 but I've seen this in multiple versions new ones and older ones

Please be as specific as possible and give set-up details.

  • [ ] on-prem machine
  • [ ] VM (Virtualbox, KVM, etc. please specify)
  • [ ] VM running on a cloud service, please be explicit and add details
  • [ ] container (Kubernetes, Docker, containerd, etc. please specify)
  • [ ] or a combination, please be explicit
  • [ ] jails if it is FreeBSD
  • [ ] classic packaging
  • [ ] onedir packaging
  • [ ] used bootstrap to install

Steps to Reproduce the behavior (Include debug logs if possible and relevant)

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
PASTE HERE

Additional context Add any other context about the problem here.

amalaguti avatar Apr 16 '24 17:04 amalaguti

@amalaguti Are you able to test this against 3006.8?

dwoz avatar May 01 '24 22:05 dwoz

@dwoz It seems a bit better in 3006.8. it feels like it can reconnect better than before

But in the process of testing this I found the following issue #66497

And this one #66375 is still present too

amalaguti avatar May 10 '24 03:05 amalaguti