salt icon indicating copy to clipboard operation
salt copied to clipboard

Sporadic KeyError: 'runner' in reactor

Open jbg-uwaterloo opened this issue 4 years ago • 7 comments

Description 3002 salt master on RHEL7. We have a reactor watching and alerting via email on failures. The reactor generally works and has generated 7 valid and 2 errors today. On Jan 01 it produced 3 valid and one error.

Python trace below

2022-01-04 08:05:29,651 [salt.utils.reactor:489 ][ERROR   ][2058] Reactor 'email-on-fail' failed to execute runner 'process_minion_data.email_errors'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 468, in run
    ret = l_fun(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 496, in runner
    return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 70, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'runner'

Setup Reactor config (email addresses sanitized)

email-on-fail:
  runner.process_minion_data.email_errors:
    - fromaddr: XXXXXXXXXXX
    - toaddrs: XXXXXXXXXXXXX
    - data_str: {{ data|yaml_dquote }}
    - smtp_server: 127.0.0.1

(Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details. RHEL7 on-prem VM.

Steps to Reproduce the behavior (Include debug logs if possible and relevant)

Expected behavior No python errors reported in salt master log

Screenshots

Versions Report

Salt Version:
          Salt: 3002.7
 
Dependency Versions:
          cffi: Not Installed
      cherrypy: unknown
      dateutil: 2.4.2
     docker-py: Not Installed
         gitdb: 0.6.4
     gitpython: 1.0.1
        Jinja2: 2.11.1
       libgit2: Not Installed
      M2Crypto: 0.35.2
          Mako: Not Installed
       msgpack: 0.6.2
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: Not Installed
      pycrypto: 2.6.1
  pycryptodome: Not Installed
        pygit2: Not Installed
        Python: 3.6.8 (default, Aug 13 2020, 07:46:32)
  python-gnupg: Not Installed
        PyYAML: 3.13
         PyZMQ: 17.0.0
         smmap: 0.9.0
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.1.4
 
System Versions:
          dist: rhel 7.9 Maipo
        locale: UTF-8
       machine: x86_64
       release: 3.10.0-1160.49.1.el7.x86_64
        system: Linux
       version: Red Hat Enterprise Linux Server 7.9 Maipo

jbg-uwaterloo avatar Jan 04 '22 16:01 jbg-uwaterloo

Similar to https://github.com/saltstack/salt/issues/52961 (salt 3000)

jbg-uwaterloo avatar Jan 04 '22 16:01 jbg-uwaterloo

Looking at this, I'm almost positive that the issue, for some reason, is that the reaction_type isn't in self.client_cache - there's code in salt/utils/reactor that removes the item from the cache if it's expired. In my limited testing, I couldn't get it to exhibit this behavior, even with a ridiculously small TTL (2s, vs default 60s). It would delete the runner from the cache & retry.

It's entirely possible, though, that there's something else at play here with the interplay of more than one runner, the time it takes to actually run that runner, etc.

If someone has some spare cycles to try and track that down, that would be .... oh nope, just figured it out!

In salt/utils/reactor.py, it calls self.populate_client_cache(low), and then it gets the runner function, and calls it with l_fun(*args, **kwargs) (seen in the stack trace)

With my 2s timer, I added this:

            import time; time.sleep(5)
            ret = l_fun(*args, **kwargs)

100% of the time I see this error. So it looks like what's happening is between grabbing runner and actually making this call the TTL expires.

One potential fix would be to retrieve the runner from client_cache and pass that to the runner function instead. It looks like wheel would suffer the same problem. Another option would be to catch the exception and maybe repopulate the cache? I'm not currently familiar with why the cache/TTL is even happening, so there may be a better way to handle that. My gut says that if we hit salt/utils/reactor::ReactWrap.run and there's a live function at the start of the call it should be useful when it's finally called... but I'm not positive.

Anyway - thanks again for the report!

waynew avatar Jan 07 '22 17:01 waynew

We're seeing the same thing in v3002.2:

2022-10-11 13:25:08,469 [salt.template    :31  ][PROFILE ][1401253] Time (in seconds) to render '/srv/salt/reactor/sync_dns.sls' using 'yaml' renderer: 0.0008473396301269531
2022-10-11 13:25:08,471 [salt.utils.reactor:489 ][ERROR   ][1401253] Reactor 'sync_dns' failed to execute runner 'state.orchestrate'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 442, in run
    react_fun = self.client_cache[low["state"]].functions.get(
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 70, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'runner'

On average, i'd say 1% of the reactor calls runs into this race condition. In our case, that's mission critical. Given there's no follow up on this issue i can only assume it isn't fixed in v3005, though we will upgrade to a latest version soon.

fwiw, i've not been able to reproduce it anymore after enabling retry for the reactor runner:

sync_dns:
  runner.state.orchestrate:
    - retry:
        attempts: 3
        interval: 1
    - args:
      - mods: sync_dns

Looking at the reactor code i would say that's purely coincidental but we've not hit the exception anymore after a few thousand test runs.

gbunt avatar Oct 12 '22 10:10 gbunt

Started seeing this 2 days ago and now I have 900+ log messages, each throwing the following:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 439, in run
    ret = l_fun(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/salt/utils/reactor.py", line 467, in runner
    return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))
  File "/usr/lib/python3.6/site-packages/salt/utils/cache.py", line 64, in __getitem__
    return dict.__getitem__(self, key)
KeyError: 'runner'

I have several events configured to run orchestrate runners and this happens with all of them. v3004.2 here.

vkotarov avatar Feb 07 '23 17:02 vkotarov

seen also: 3004.2

[ERROR ] Reactor 'handle_stats' failed to execute runner 'salt.cmd' Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/salt/utils/reactor.py", line 439, in run ret = l_fun(*args, **kwargs) File "/usr/lib/python3.10/site-packages/salt/utils/reactor.py", line 467, in runner return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs)) File "/usr/lib/python3.10/site-packages/salt/utils/cache.py", line 64, in getitem return dict.getitem(self, key) KeyError: 'runner'

happening all the time

i also tried the retry approach from above - no observable effect

gmarzot avatar Jul 11 '23 20:07 gmarzot

@garethgreenaway This appears similar to https://github.com/saltstack/salt/issues/52961. I have additional logs available if required to dig into this further.

doesitblend avatar Sep 29 '23 15:09 doesitblend

I was able to fix this using @waynew's suggestion above. Here's a patch:

--- utils/reactor.py	2024-09-30 09:55:06
+++ utils/reactor.py	2024-09-30 09:54:47
@@ -461,6 +461,10 @@
         """
         Wrap RunnerClient for executing :ref:`runner modules <all-salt.runners>`
         """
+        if "runner" not in self.client_cache:
+            log.debug("reactor edge case: re-populating client_cache")
+            low = {"state": "runner"}
+            self.populate_client_cache(low)
         return self.pool.fire_async(self.client_cache["runner"].low, args=(fun, kwargs))

     def wheel(self, fun, **kwargs):

This should be expanded to include other file_clients as well. I would submit a PR but I still don't understand how salt's testing stack works, but maybe somebody can run with this.

wasabi222 avatar Sep 30 '24 12:09 wasabi222

Reactor tests in test/unit/utils/test_reactor.py, need to convert to pytest and add mock tests to ensure repopulation if specific reactor file_client is missing in client_cache.

dmurphy18 avatar Feb 21 '25 22:02 dmurphy18

Closing since PR https://github.com/saltstack/salt/pull/67764 is merged.

dmurphy18 avatar Mar 07 '25 21:03 dmurphy18