fab-classic icon indicating copy to clipboard operation
fab-classic copied to clipboard

Use poll implementation to fix file descriptor leak

Open Stealthii opened this issue 2 years ago • 3 comments

Using select() here causes a file descriptor leak. We've replaced this with a poll() implementation to avoid this.

The polling object needs to be opened with an eventmask that stops it from blocking paramiko.

Fixes #51.

Stealthii avatar Apr 12 '22 08:04 Stealthii

Correct me if I'm wrong, but I think file descriptors (pipes) are leaking due to the issue in this PR: https://github.com/ploxiln/fab-classic/pull/74

And the issue is manifesting in issues with select calls.

With that said, here's the docs for the select syscall that backs select.select https://man7.org/linux/man-pages/man2/select.2.html

WARNING: select() can monitor only file descriptors numbers that
       are less than FD_SETSIZE (1024)—an unreasonably low limit for
       many modern applications—and this limitation will not change.
       All modern applications should instead use [poll(2)](https://man7.org/linux/man-pages/man2/poll.2.html) or [epoll(7)](https://man7.org/linux/man-pages/man7/epoll.7.html),
       which do not suffer this limitation.

select seems to be a problematic syscall, and "all modern applications should use poll/epoll" so replacing it with select.poll seems like the right play even if it isn't the cause of the leak.

jealouscloud avatar Apr 20 '22 14:04 jealouscloud

@jealouscloud there may be multiple sources of file descriptor leaks - in this instance the 1024 limit is definitely hit under select() when running Fabric against a large number of hosts in parallel (5k+).

The above code for io.py has been tested in a production capacity, but I extended the poll implementation to the forwarder in context_managers.py after seeing #51.

Stealthii avatar Apr 26 '22 07:04 Stealthii

@jealouscloud there may be multiple sources of file descriptor leaks - in this instance the 1024 limit is definitely hit under select() when running Fabric against a large number of hosts in parallel (5k+).

One of the ways I troubleshot this issue was taking a ls -l look at /proc/$pid/fd where $pid is the process ID of the primary/parent fabric process.

Before noticing the leak cause of the leak for #74 I recall noticing there was a big difference in fd allocation when I used --eagerly-disconnect. That may or may not have helped your case but if it did, that might have been part of intended functionality instead of an unintended leak of unused resources. That said, fd count absolutely scales with your -z value too, depending on what that might have been.

I'm not saying there aren't potentially multiple leaks but I do want to add what i've seen to the conversation.

All that said, I still think this is a good/necessary change, leak or no.

jealouscloud avatar Apr 26 '22 14:04 jealouscloud