salt icon indicating copy to clipboard operation
salt copied to clipboard

[BUG] syndic event forwarding broken with 3006.10

Open hurzhurz opened this issue 8 months ago • 12 comments

Description Since 3006.10, the forwarding of events from Syndic to the higher level Master doesn't work properly anymore. In such cases, it even raises an exception: TypeError: Minion._fire_master() got an unexpected keyword argument 'sync'

The cause seems to be this change: https://github.com/saltstack/salt/commit/f6aa6ff8073c7aaf1e8b0ffaef06c8e51f531223

Removing this line seems to fix the issue: https://github.com/saltstack/salt/blob/21e99f6e577a59d8e965e3471d81ea442817e940/salt/minion.py#L3871

Setup Ubuntu 24.04.2 LTS with Salt 3006.10 from official repository

  • Master-of-Masters (test-salt-mom): with salt-master and salt-minion
  • Syndic (test-salt-syndic): with salt-master, salt-syndic and salt-minion

Each minion is connected to its local master. And syndic of course to the Master-of-Masters. No special config in this test setup, just the bare minimum to connect them with another.

  • [x] VM (Virtualbox, KVM, etc. please specify)
  • [x] container (LXC)
  • [x] onedir packaging

Steps to Reproduce the behavior

Let syndic-minion fire event:

root@test-salt-mom:~# salt test-salt-syndic event.fire_master '{"data":"my event data"}' 'tag'

test-salt-syndic:
    True

Syndic-Log with exception:

root@test-salt-syndic:~# tail -f /var/log/salt/syndic

2025-04-02 09:11:29,219 [tornado.application:640 ][ERROR   ][24153] Exception in callback <bound method SyndicManager._forward_events of <salt.minion.SyndicManager object at 0x7ff971b9d3c0>>
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 1028, in _run
    return self.callback()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 3865, in _forward_events
    self._call_syndic(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 3693, in _call_syndic
    getattr(syndic_future.result(), func)(*args, **kwargs)
TypeError: Minion._fire_master() got an unexpected keyword argument 'sync'

Few events received on MoM, actually fired event is missing:

root@test-salt-mom:~# salt-run state.event pretty=True

20250402091128759007    {
    "_stamp": "2025-04-02T09:11:28.759404",
    "minions": [
        "test-salt-syndic"
    ]
}
salt/job/20250402091128759007/new       {
    "_stamp": "2025-04-02T09:11:28.759956",
    "arg": [
        {
            "data": "my event data"
        },
        "tag"
    ],
    "fun": "event.fire_master",
    "jid": "20250402091128759007",
    "minions": [
        "test-salt-syndic"
    ],
    "missing": [],
    "tgt": "test-salt-syndic",
    "tgt_type": "glob",
    "user": "root"
}
salt/job/20250402091128759007/ret/test-salt-syndic      {
    "_stamp": "2025-04-02T09:11:29.726911",
    "fun": "event.fire_master",
    "fun_args": [
        {
            "data": "my event data"
        },
        "tag"
    ],
    "id": "test-salt-syndic",
    "jid": "20250402091128759007",
    "retcode": 0,
    "return": true,
    "success": true
}

Expected behavior

With interims fix (mentioned line removed):

root@test-salt-mom:~# salt test-salt-syndic event.fire_master '{"data":"my event data"}' 'tag'

test-salt-syndic:
    True

All events received on MoM:

root@test-salt-mom:~# salt-run state.event pretty=True

20250402090836106050    {
    "_stamp": "2025-04-02T09:08:36.106334",
    "minions": [
        "test-salt-syndic"
    ]
}
salt/job/20250402090836106050/new       {
    "_stamp": "2025-04-02T09:08:36.107028",
    "arg": [],
    "fun": "test.ping",
    "jid": "20250402090836106050",
    "minions": [
        "test-salt-syndic"
    ],
    "missing": [],
    "tgt": "test-salt-syndic",
    "tgt_type": "glob",
    "user": "root"
}
20250402090836106050    {
    "_stamp": "2025-04-02T09:08:36.200337",
    "minions": [
        "test-salt-syndic"
    ]
}
syndic/test-salt-syndic/20250402090836106050    {
    "_stamp": "2025-04-02T09:08:36.201327",
    "minions": [
        "test-salt-syndic"
    ]
}
salt/job/20250402090836106050/new       {
    "_stamp": "2025-04-02T09:08:36.201811",
    "arg": [],
    "fun": "test.ping",
    "jid": "20250402090836106050",
    "minions": [
        "test-salt-syndic"
    ],
    "missing": [],
    "tgt": "test-salt-syndic",
    "tgt_type": "glob",
    "user": "root"
}
syndic/test-salt-syndic/salt/job/20250402090836106050/new       {
    "_stamp": "2025-04-02T09:08:36.202655",
    "arg": [],
    "fun": "test.ping",
    "jid": "20250402090836106050",
    "minions": [
        "test-salt-syndic"
    ],
    "missing": [],
    "tgt": "test-salt-syndic",
    "tgt_type": "glob",
    "user": "root"
}
salt/job/20250402090836106050/ret/test-salt-syndic      {
    "_stamp": "2025-04-02T09:08:36.207895",
    "fun": "test.ping",
    "fun_args": [],
    "id": "test-salt-syndic",
    "jid": "20250402090836106050",
    "retcode": 0,
    "return": true,
    "success": true
}

Versions Report

salt --versions-report
Salt Version:
          Salt: 3006.10

Python Version:
        Python: 3.10.16 (main, Mar  6 2025, 02:23:15) [GCC 11.2.0]

Dependency Versions:
          cffi: 1.14.6
      cherrypy: unknown
  cryptography: 42.0.5
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 3.1.6
       libgit2: Not Installed
  looseversion: 1.0.2
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     packaging: 22.0
     pycparser: 2.21
      pycrypto: Not Installed
  pycryptodome: 3.19.1
        pygit2: Not Installed
  python-gnupg: 0.4.8
        PyYAML: 6.0.1
         PyZMQ: 23.2.0
        relenv: 0.18.1
         smmap: Not Installed
       timelib: 0.2.4
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: ubuntu 24.04.2 noble
        locale: utf-8
       machine: x86_64
       release: 5.15.167.4-microsoft-standard-WSL2
        system: Linux
       version: Ubuntu 24.04.2 noble

hurzhurz avatar Apr 02 '25 09:04 hurzhurz

I had to downgrade to 3006.9 to be able to use Salt again. Removing "sync": True didn't seem to work. test.ping was running without any problem. However state.apply always failed with 3006.10.

tkaehn avatar Apr 04 '25 13:04 tkaehn

@tkaehn I don't think your problem is related to this event-forwarding-issue. Even without removing that line, a state.apply should work and show a return, too. Just some events about the execution would not be forwarded from a syndic to a upstream master.

I think you probably need to open a new issue for your specific problem, with logs etc. ...

hurzhurz avatar Apr 07 '25 07:04 hurzhurz

Thanks for your suggestion. I re-tried by upgrading to 3006.10 and I could see the same error messages in the syndic log:

[...]
TypeError: Minion._fire_master() got an unexpected keyword argument 'sync'

All directly connected minions returned from a state.apply. In contrast to this there was no return from the minions connected to the syndic server. Running state.apply directly on the syndic showed that the minions were connected.

Strangely test.ping and pillar.get are showing returns for all minions.

After removing "sync": True the issue seems to be solved. Maybe I was a little bit too impatient last week.

However, I still receive the following log messages (probably unrelated as it happens on the syndic, too):

2025-04-07 11:06:18,036 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28370] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:18,811 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28365] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:18,814 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28392] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:18,869 [salt.master      :1180][ERROR   ][28351] Requested method not exposed: _ext_nodes
2025-04-07 11:06:18,926 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28349] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:19,068 [salt.master      :1180][ERROR   ][28366] Requested method not exposed: _ext_nodes
2025-04-07 11:06:20,529 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28394] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:20,537 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28391] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:20,545 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28363] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:20,551 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28394] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:20,556 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28364] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:20,595 [salt.master      :1180][ERROR   ][28364] Requested method not exposed: _ext_nodes
2025-04-07 11:06:20,649 [salt.master      :1180][ERROR   ][28351] Requested method not exposed: _ext_nodes
2025-04-07 11:06:27,197 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28371] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,199 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28349] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,200 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28367] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,202 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28350] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,207 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28347] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,212 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28368] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,214 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28370] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,218 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28371] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,221 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28369] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,227 [salt.loaded.int.returner.local_cache:148 ][ERROR   ][28373] An extra return was detected from minion minion####, please verify the minion, this could be a replay attack
2025-04-07 11:06:27,251 [salt.master      :1180][ERROR   ][28349] Requested method not exposed: _ext_nodes
2025-04-07 11:06:27,259 [salt.master      :1180][ERROR   ][28352] Requested method not exposed: _ext_nodes

Those messages appear since a couple of versions and only if a lot of minions are targeted. I've checked that there are no duplicate keys. So the messages are false positives.

tkaehn avatar Apr 07 '25 09:04 tkaehn

Hmm... maybe your problems came from a bad connection state of the syndic: I got the impression that sometimes the syndic doesn't properly reconnect if e.g. the (mom-)master was restarted. It then can look like the connection is unidirectional: the syndic receives requests/commands, but the master doesn't receive returns. Sometimes maybe also vice versa.

So perhaps while removing "sync": True is the fix for TypeError: Minion._fire_master() got an unexpected keyword argument 'sync', what actually helped you was restarting the syndic-service afterwards?

And about the other log messages:

Requested method not exposed: _ext_nodes: I did a quick look at the code and git history for _ext_nodes. Looks like this is a very old deprecated function of the master. Do you have any minions with very old salt versions (older than 3003)?

An extra return was detected from minion minion####, please verify the minion, this could be a replay attack: I see that message too from time to time. I think there could be multiple causes. But probably a minion or syndic repeats the returns because it doesn't get the response that it was received within a certain time. Maybe it is a connection issue like mentioned above. Or maybe the master is overloaded and can't process the returns in time (not enough worker threads or bad IO performance). But that are just guesses.

hurzhurz avatar Apr 07 '25 10:04 hurzhurz

Thanks for you further ideas. Reverting to the original file and restarting the syndic process leads to the same problem as before. state.apply only returns directly connected minions and e.g. test.ping returns for all minions. Even waiting a couple of minutes doesn't help.

The exception looks like this:

2025-04-07 14:02:53,407 [tornado.application:640 ][ERROR   ][50622] Exception in callback <bound method SyndicManager._forward_events of <salt.minion.SyndicManager object at 0x7fca509ec850>>
Traceback (most recent call last):
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/ext/tornado/ioloop.py", line 1028, in _run
    return self.callback()
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 3865, in _forward_events
    self._call_syndic(
  File "/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py", line 3693, in _call_syndic
    getattr(syndic_future.result(), func)(*args, **kwargs)
TypeError: Minion._fire_master() got an unexpected keyword argument 'sync'

Generally my impression is that it is best to restart the processes in order master, syndic, minion.

We have indeed older minions which could explain the _ext_nodes message.

Regarding the duplicate returns I tried to increase return_retry_timer and return_retry_timer_max and also tried to run batches of 50 instead of 100. Both seemed to help a little bit and reduce the amount of messages. I guess I need to figure out a way to increase the performance of the master. However, it's already using 24 worker processes.

tkaehn avatar Apr 07 '25 13:04 tkaehn

Can confirm this issue. Commands from master-of-master would also take unusually long to complete before commenting out the line in question. Output would be returned quickly, but it seemed the salt command was waiting for something more from the syndic, before eventually timing out.

sverre avatar Apr 25 '25 15:04 sverre

@dwoz can you have a look at this issue? As the mentioned change is from you, I guess you can tell if removing the one line is acceptable and sufficient as solution.

And I would consider making a PR myself, but I'm not sure how to write an adequate test for this issue (and how to run it to try it out).

hurzhurz avatar Apr 28 '25 11:04 hurzhurz

I also can confirm this issue. Running orchestration to minion connected through the syndic resulted in lots of random errors and was no usable. Commenting out '"sync": True,' and restarting the salt-syndic and salt-minion on the syndic fixed the issue.

PeterLannigan avatar May 14 '25 00:05 PeterLannigan

Unfortunately this hasn't been fixed with 3006.11.

tkaehn avatar Jun 05 '25 11:06 tkaehn

We are having this problem as well. Frustrating how this is not getting fixed even with a PR out there to do it.

xamindar avatar Jun 16 '25 18:06 xamindar

Unfortunately this hasn't been fixed with 3006.12 either.

xamindar avatar Jun 16 '25 19:06 xamindar

Unfortunately this hasn't been fixed with 3006.12 either.

3006.12 was a CVE-fix only release. I hope my PR will be included in the next regular release (maybe 3006.13).

But if you want to automate the needed change as a workaround, you could do something like this:

/opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py:
  file.comment:
    - regex: '^\s.*"sync": True,  # Sync needs to be true unless being called from a coroutine.*'

restart-syndic:
  service.running:
    - name: salt-syndic
    - watch:
      - file: /opt/saltstack/salt/lib/python3.10/site-packages/salt/minion.py

hurzhurz avatar Jun 16 '25 19:06 hurzhurz

It seems to be even worse. With 3006.12 the comunication between syndic and master stops completely after some hours. There are no helpful messages in the logs. Restarting the syndic resolves the problem. However afterwards there is still a unusually long delay at the end when running jobs. Restarting the master of masters solves the problem completely.

Directly connected minions have no problem reaching the masters.

Log message of the syndic process:

2025-06-17 08:34:43,205 [salt.minion      :3726][ERROR   ][2937636] Unable to call _return_pub_multi on master-of-masters, trying another...
2025-06-17 08:34:43,798 [salt.minion      :2460][WARNING ][2937636] The minion failed to return the job information for job 20250617040213611080. This is often due to the master being shut down or overloaded. If the master is running, consider increasing the worker_threads value.

Apart from upgrading from 3006.11 to 3006.12 there were no changes on the masters.

Really annoying that almost every new version introduces new issues.

tkaehn avatar Jun 17 '25 07:06 tkaehn

fixed in 3006.13 / 3007.5

hurzhurz avatar Jun 30 '25 14:06 hurzhurz