salt icon indicating copy to clipboard operation
salt copied to clipboard

[BUG] v3004 leaks fds/pipes causing `Too many open files` crash

Open frebib opened this issue 3 years ago • 16 comments

Description Several nodes all running salt-minion 3004 all seem to cause this crash before highstate completes. We've seen this crash both in cmd.run and also zmq code.

coredumpctl seems to think it's zeromq holding open the sockets. There are a lot of these in the coredump backtrace.

Stack trace of thread 59913:
#0  0x00007f715250c116 epoll_wait (libc.so.6 + 0xfe116)
#1  0x00007f7151619fe1 n/a (libzmq.so.5 + 0x2dfe1)
#2  0x00007f715165f1cb n/a (libzmq.so.5 + 0x731cb)
#3  0x00007f7152778ea7 start_thread (libpthread.so.0 + 0x8ea7)
#4  0x00007f715250bdef __clone (libc.so.6 + 0xfddef)
[ERROR   ] An exception occurred in this state: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/state.py", line 2144, in call
    ret.update(self._run_check(low))
  File "/usr/lib/python3/dist-packages/salt/state.py", line 920, in _run_check
    _ret = self._run_check_unless(low_data, cmd_opts)
  File "/usr/lib/python3/dist-packages/salt/state.py", line 1053, in _run_check_unless
    cmd = self.functions["cmd.retcode"](
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 149, in __call__
    return self.loader.run(run_func, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1201, in run
    return self._last_context.run(self._run_as, _func_or_method, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/loader/lazy.py", line 1216, in _run_as
    return _func_or_method(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/salt/modules/cmdmod.py", line 2507, in retcode
    ret = _run(
  File "/usr/lib/python3/dist-packages/salt/modules/cmdmod.py", line 525, in _run
    env_bytes, env_encoded_err = subprocess.Popen(
  File "/usr/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.9/subprocess.py", line 1722, in _execute_child
    errpipe_read, errpipe_write = os.pipe()
OSError: [Errno 24] Too many open files
...
Jan 26 14:14:49 131m16 salt-highstate[7759]: Too many open files (src/epoll.cpp:65)
Jan 26 14:15:05 131m16 systemd-coredump[109479]: [🡕] Process 7759 (salt-call) of user 0 dumped core.
Jan 26 14:15:05 131m16 Process 7759 (salt-call) of user 0 dumped core.
jgroocock@131m16:~$ sudo ls -al /proc/114133/fd
total 0
dr-x------ 2 root root  0 Jan 26 13:41 .
dr-xr-xr-x 9 root root  0 Jan 26 13:41 ..
lr-x------ 1 root root 64 Jan 26 13:43 0 -> /dev/null
lrwx------ 1 root root 64 Jan 26 13:43 1 -> 'socket:[3220729]'
lrwx------ 1 root root 64 Jan 26 13:43 10 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 100 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 101 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 102 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 103 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 104 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 105 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 106 -> 'socket:[3881074]'
lrwx------ 1 root root 64 Jan 26 13:56 107 -> 'anon_inode:[eventpoll]'
lr-x------ 1 root root 64 Jan 26 13:56 108 -> 'pipe:[3831934]'
l-wx------ 1 root root 64 Jan 26 13:56 109 -> 'pipe:[3831934]'
lrwx------ 1 root root 64 Jan 26 13:43 11 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 110 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 111 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 112 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 113 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 114 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 115 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 116 -> 'socket:[3830398]'
lrwx------ 1 root root 64 Jan 26 13:56 117 -> 'anon_inode:[eventpoll]'
lr-x------ 1 root root 64 Jan 26 13:56 118 -> 'pipe:[3854314]'
l-wx------ 1 root root 64 Jan 26 13:56 119 -> 'pipe:[3854314]'
lrwx------ 1 root root 64 Jan 26 13:43 12 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 120 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 121 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 122 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 123 -> 'anon_inode:[eventfd]'
lrwx------ 1 root root 64 Jan 26 13:56 124 -> 'anon_inode:[eventpoll]'
lrwx------ 1 root root 64 Jan 26 13:56 125 -> 'anon_inode:[eventfd]'

This was observed doing a full highstate on an unconfigured machine with several thousand states.

jgroocock@131m16:~$ systemctl show -P LimitNOFILE salt-highstate.service
524288

I've also noticed that v3004 has started dumping a handful of [ERROR ] Unable to connect pusher: Stream is closed errors into the highstate, which didn't happen in 2019.2.8. I wonder if this could be related?

[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] salt.crypt.get_rsa_pub_key: Loading public key
[DEBUG   ] Closing AsyncReqChannel instance
[DEBUG   ] Connecting the Minion to the Master publish port, using the URI: tcp://[<snip>]:4505
[DEBUG   ] <salt.transport.zeromq.PublishClient object at 0x7fda638a6f70> connecting to tcp://[<snip>]:4505
[TRACE   ] IPCClient: Connecting to socket: /var/run/salt/minion/minion_event_f9c4bf71a1_pull.ipc
[TRACE   ] IPCClient: Connecting to socket: /var/run/salt/minion/minion_event_f9c4bf71a1_pull.ipc
[ERROR   ] Unable to connect pusher: Stream is closed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/salt/utils/event.py", line 421, in connect_pull
    self.pusher.connect(timeout=timeout)
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/usr/lib/python3/dist-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/usr/lib/python3/dist-packages/salt/transport/ipc.py", line 342, in _connect
    yield self.stream.connect(sock_addr)
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/usr/lib/python3/dist-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.ext.tornado.iostream.StreamClosedError: Stream is closed
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key

Possibly points at https://github.com/saltstack/salt/commit/4cf62fbc955df7ea90d8267018ff62296e729253 ?

Setup (Please provide relevant configs and/or SLS files (be sure to remove sensitive info. There is no general set-up of Salt.)

Please be as specific as possible and give set-up details.

  • [x] on-prem machine
  • [ ] VM (Virtualbox, KVM, etc. please specify)
  • [ ] VM running on a cloud service, please be explicit and add details
  • [ ] container (Kubernetes, Docker, containerd, etc. please specify)
  • [ ] or a combination, please be explicit
  • [ ] jails if it is FreeBSD

Steps to Reproduce the behavior salt-call state.highstate seems to trigger it. It's unclear if the persistent minion process is affected so far.

??? Unclear as of yet.

Expected behavior No leaked fds/pipes.

Additional context We came from 2019.2.8 running on python3 that didn't exhibit this behaviour (as far as I know). I can also reproduce this on master branch tip too.

jgroocock@133m11:~$ salt-call -V | grep -v Not\ Installed
Salt Version:
          Salt: 3004

Dependency Versions:
          cffi: 1.14.5
      dateutil: 2.8.1
        Jinja2: 2.11.3
      M2Crypto: 0.38.0
       msgpack: 1.0.0
     pycparser: 2.20
  pycryptodome: 3.9.7
        Python: 3.9.2 (default, Feb 28 2021, 17:03:44)
        PyYAML: 5.3.1
         PyZMQ: 20.0.0
       Tornado: 4.5.3
           ZMQ: 4.3.4

System Versions:
          dist: debian 11 bullseye
        locale: utf-8
       machine: x86_64
       release: 5.10.76-cloudflare-2021.10.8
        system: Linux
       version: Debian GNU/Linux 11 bullseye

frebib avatar Jan 26 '22 14:01 frebib

Looking at the coredump with zmq symbols from Debian's debuginfod, there are a lot of threads that are hung on epoll_wait implying that something opened the pipes but never closed them

Thread 204 (Thread 0x7f6f0cfba700 (LWP 110788)):
#0  0x00007f715250c116 in epoll_wait (epfd=974, events=events@entry=0x7f6f0cfb9200, maxevents=maxevents@entry=256, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f7151619fe1 in zmq::epoll_t::loop (this=0x110c7ed0) at src/epoll.cpp:184
#2  0x00007f715165f1cb in thread_routine (arg_=0x110c7f28) at src/thread.cpp:257
#3  0x00007f7152778ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f715250bdef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 203 (Thread 0x7f6fa7fff700 (LWP 90832)):
#0  0x00007f715250c116 in epoll_wait (epfd=682, events=events@entry=0x7f6fa7ffe200, maxevents=maxevents@entry=256, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f7151619fe1 in zmq::epoll_t::loop (this=0xb608560) at src/epoll.cpp:184
#2  0x00007f715165f1cb in thread_routine (arg_=0xb6085b8) at src/thread.cpp:257
#3  0x00007f7152778ea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
#4  0x00007f715250bdef in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

frebib avatar Jan 26 '22 15:01 frebib

I reverted the following commits as suggested by @dwoz in https://github.com/saltstack/salt/issues/61565#issuecomment-1028379988 but the leak remains

ab67042c6518da2826e53dc60fe7ec0fdeb7e843 0f9c0316cc265cfe2a3b0644133e0e99ff8fd0a7 d5a16a1be56118ee570e328f9eb21b1a05e56be2 7af10f24d5d109eed310ddde8b93c091a14b6991

frebib avatar Feb 03 '22 09:02 frebib

~By some miracle #61468 seems to allow the salt-call highstate to stay at a steady 45 fds during the highstate. I'm not quite sure how this fixes it when reverting the singleton patches doesn't but 🤷🏻~ I'll have a go at maybe backporting it into v3004

frebib avatar Feb 03 '22 17:02 frebib

By some miracle #61468 seems to allow the salt-call highstate to stay at a steady 45 fds during the highstate. I'm not quite sure how this fixes it when reverting the singleton patches doesn't but 🤷🏻 I'll have a go at maybe backporting it into v3004

It makes some sense because the singletons were based on the IOLoop being passed into the transport. Often, we won't pass a loop into the init method, meaning we don't end up with a singleton. This was one of the factors which have lead us away from the singleton approach. In the future we're aiming to properly manage our network connections without the need for any kind of magic (e.g singletons).

dwoz avatar Feb 05 '22 21:02 dwoz

One thing I don't understand is why this affects salt-call but not the persistent minion process. I see the spaghetti of initialisation code does a lot of stuff I don't quite understand but beyond that the same Minion and SMinion classes are used internally. Is this something to do with the ioloop again? Is the salt-call minion creating a new ioloop for every request?

frebib avatar Feb 21 '22 12:02 frebib

I backported both #61450 and #61468 onto v3004 and the leak still persists. I don't even know how to go about debugging this any more. To be clear this only happens with salt-call state.highstate and seemingly only in the most awkward to reproduce "only on a blank/freshly booted node". I struggle to reproduce any kind of a leak during a highstate on an already-highstated node. Our nodes are stateless so come up entirely blank and Salt installs/configures everything on first boot.

frebib avatar Feb 21 '22 16:02 frebib

Just a heads up, this may not be a v3004 issue in particular. We saw the same or very similar failures with Too many open files immediately after upgrading from v2019.2.8 to v3003.3, which took us from python2 to 3 as well.

While we were able to resolve this by simply upping open file limits compared to where they were on 2019.2.8, the only box it has occurred on is a heavily active master-of-masters that is also a minion of itself. Like frebib, we hit this very quickly using salt-call while bootstrapping a new, fresh box, pointed at itself as the master, with everything identical except the upgraded os/python/salt versions.

We did see this crop up once more when processing many events and doubled limits again. Unfortunately, I don't have stack traces from the time saved, but I can tell you that they look just like the ones above, including:

salt.ext.tornado.iostream.StreamClosedError: Stream is closed
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key

vosscodes avatar Mar 30 '22 15:03 vosscodes

I feel like I'm getting close to this one. Here are some of my findings:

  • It affects the persistent minion process and not just salt-call
  • It seems to be triggered by the installation of a package:
2022-09-01 01:01:03,024 [salt.state       :319 ][INFO    ][584383][JID: 20220901005829426597] Made the following changes:
'dbus' changed from 'absent' to '1.12.20-2'

2022-09-01 01:01:03,024 [salt.state       :1230][DEBUG   ][584383][JID: 20220901005829426597] Refreshing modules...
2022-09-01 01:01:03,026 [salt.state       :1194][INFO    ][584383][JID: 20220901005829426597] Loading fresh modules for state activity
2022-09-01 01:01:03,365 [salt.utils.schedule:32  ][TRACE   ][566777] ==== evaluating schedule now None =====
2022-09-01 01:01:04,076 [salt.loader.lazy :960 ][DEBUG   ][584383][JID: 20220901005829426597] The functions from module 'jinja' are being loaded by dir() on the loaded module
2022-09-01 01:01:04,076 [salt.utils.lazy  :99  ][DEBUG   ][584383][JID: 20220901005829426597] LazyLoaded jinja.render
2022-09-01 01:01:04,077 [salt.loader.lazy :960 ][DEBUG   ][584383][JID: 20220901005829426597] The functions from module 'yaml' are being loaded by dir() on the loaded module
2022-09-01 01:01:04,077 [salt.utils.lazy  :99  ][DEBUG   ][584383][JID: 20220901005829426597] LazyLoaded yaml.render
2022-09-01 01:01:04,079 [salt.loader.lazy :960 ][DEBUG   ][584383][JID: 20220901005829426597] The functions from module 'saltutil' are being loaded by dir() on the loaded module
2022-09-01 01:01:04,080 [salt.utils.lazy  :99  ][DEBUG   ][584383][JID: 20220901005829426597] LazyLoaded saltutil.refresh_modules
2022-09-01 01:01:04,080 [salt.loader.lazy :960 ][DEBUG   ][584383][JID: 20220901005829426597] The functions from module 'event' are being loaded by dir() on the loaded module
2022-09-01 01:01:04,080 [salt.utils.lazy  :99  ][DEBUG   ][584383][JID: 20220901005829426597] LazyLoaded event.fire
2022-09-01 01:01:04,081 [salt.utils.event :315 ][DEBUG   ][584383][JID: 20220901005829426597] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_f9c4bf71a1_pub.ipc
2022-09-01 01:01:04,081 [salt.utils.event :316 ][DEBUG   ][584383][JID: 20220901005829426597] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_f9c4bf71a1_pull.ipc
2022-09-01 01:01:04,082 [salt.transport.ipc:32  ][TRACE   ][584383] IPCClient: Connecting to socket: /var/run/salt/minion/minion_event_f9c4bf71a1_pull.ipc
2022-09-01 01:01:04,082 [salt.transport.ipc:32  ][TRACE   ][566777] IPCServer: Handling connection to address: <socket.socket fd=35, family=AddressFamily.AF_UNIX, type=SocketKind.SOCK_STREAM, proto=0, laddr=/var/run/salt/minion/minion_event_f9c4bf71a1_pull.ipc>
2022-09-01 01:01:04,082 [salt.utils.event :821 ][DEBUG   ][584383][JID: 20220901005829426597] Sending event: tag = module_refresh; data = {'_stamp': '2022-09-01T01:01:04.082756'}
2022-09-01 01:01:04,083 [salt.transport.ipc:373 ][DEBUG   ][584383][JID: 20220901005829426597] Closing IPCMessageClient instance

Apparently a pkg state that reports changes causes saltutil.refresh_modules to run, which in turn causes event.fire to be called, which creates a new IPCClient. This seems to have a strong correlation with the file descriptor leak that I'm seeing.

I don't really understand the io_loop code, nor the tear-down of the related assets. Calling out to @dwoz for possible suggestions on what in the IPCClient may be causing a file descriptor leak? At a guess, the data is never being read out of the socket so zmq holds it open until a reader comes and empties the buffer?

frebib avatar Sep 01 '22 01:09 frebib

Scratch that. The open file handles are TCP sockets to the master:

2022-09-06 12:48:35,047 [salt.loaded.int.module.event:117 ][DEBUG   ][9514] Process 9514: open files 23: [pconn(fd=16, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='<minion_ip>', port=48598), raddr=addr(ip='<master_ip>', port=4505), status='ESTABLISHED'), pconn(fd=92, family=<AddressFamily.AF_INET: 2>, type=<SocketKind.SOCK_STREAM: 1>, laddr=addr(ip='<minion_ip>', port=20882), raddr=addr(ip='<master_ip>', port=4506), status='ESTABLISHED'),  ...]

Strangely that number only accounts for ~10% of the open fds according to /proc/<pid>/fd. I think psutil isn't telling the whole truth I'm still not sure where they're being opened (or rather, not closed) though.

Generated with this diff

diff --git salt/modules/event.py salt/modules/event.py
index 4b655713e3a..873a06e3b34 100644
--- salt/modules/event.py
+++ salt/modules/event.py
@@ -108,14 +108,29 @@ def fire(data, tag):

         salt '*' event.fire '{"data":"my event data"}' 'tag'
     """
+    import os, psutil
+    def procstuff():
+        try:
+            pid = os.getpid()
+            proc = psutil.Process(pid=pid)
+            files = proc.connections(kind="all")
+            log.debug(f"Process {pid}: open files {len(files)}: {files}")
+        except Exception as ex:
+            log.exception(f"pid shiz failed: {ex}", ex)
+
     try:
+        procstuff()
         with salt.utils.event.get_event(
             "minion",  # was __opts__['id']
             sock_dir=__opts__["sock_dir"],
             opts=__opts__,
             listen=False,
         ) as event:
-            return event.fire_event(data, tag)
+            procstuff()
+            fired = event.fire_event(data, tag)
+            procstuff()
+        procstuff()
+        return fired
     except Exception:  # pylint: disable=broad-except
         exc_type, exc_value, exc_traceback = sys.exc_info()
         lines = traceback.format_exception(exc_type, exc_value, exc_traceback)
diff --git salt/utils/schedule.py salt/utils/schedule.py
index 5f461a47a6c..56db117baf6 100644
--- salt/utils/schedule.py
+++ salt/utils/schedule.py
@@ -948,8 +948,6 @@ class Schedule:

         """

-        log.trace("==== evaluating schedule now %s =====", now)
-
         jids = []
         loop_interval = self.opts["loop_interval"]
         if not isinstance(loop_interval, datetime.timedelta):

frebib avatar Sep 06 '22 12:09 frebib

@frebib Can you confirm is no longer an issue in 3006.x and on the head of the master branch?

dwoz avatar Aug 28 '23 05:08 dwoz

@dwoz This is a pretty tricky thing to replicate, so it might take us a while to testing it. That combined with the need to rebase our patchset on top of master/3006.x might be a problem, but I can try. I'll get back to you.

Is there a particular patch/PR that was merged that you think might have fixed the issue?

frebib avatar Aug 30 '23 12:08 frebib

@frebib Are you able to provide the output of lsof -p <pid> while this is occurring?

dwoz avatar Oct 18 '23 23:10 dwoz

@frebib Also, if you are not able to upgrade is it possible to get a working GDB installation on your current setup. One where you can get this https://github.com/python/cpython/blob/main/Tools/gdb/libpython.py to work?

dwoz avatar Nov 02 '23 22:11 dwoz

@frebib Is this still an issue in 3006.5?

dwoz avatar Dec 16 '23 22:12 dwoz

We're working on rolling out 3006.x still. We've run into a few issues/regressions, but once we get those sorted, we'll validate this and get back to you. Sorry for the delay

frebib avatar Jan 03 '24 09:01 frebib

@frebib There is another open file handle fix going into 3006.7 #65932. If you let me know of any outstanding issues/regressions I can help prioritize them.

dwoz avatar Feb 13 '24 22:02 dwoz

It seems that on 3006.5 this is no longer an issue. I don't really see any more than ~24 fds open most of the time

frebib avatar Apr 22 '24 10:04 frebib