fab-classic
fab-classic copied to clipboard
Use poll implementation to fix file descriptor leak
Using select()
here causes a file descriptor leak. We've replaced this with a poll()
implementation to avoid this.
The polling object needs to be opened with an eventmask that stops it from blocking paramiko.
Fixes #51.
Correct me if I'm wrong, but I think file descriptors (pipes) are leaking due to the issue in this PR: https://github.com/ploxiln/fab-classic/pull/74
And the issue is manifesting in issues with select
calls.
With that said, here's the docs for the select
syscall that backs select.select
https://man7.org/linux/man-pages/man2/select.2.html
WARNING: select() can monitor only file descriptors numbers that
are less than FD_SETSIZE (1024)—an unreasonably low limit for
many modern applications—and this limitation will not change.
All modern applications should instead use [poll(2)](https://man7.org/linux/man-pages/man2/poll.2.html) or [epoll(7)](https://man7.org/linux/man-pages/man7/epoll.7.html),
which do not suffer this limitation.
select
seems to be a problematic syscall, and "all modern applications should use poll/epoll" so replacing it with select.poll
seems like the right play even if it isn't the cause of the leak.
@jealouscloud there may be multiple sources of file descriptor leaks - in this instance the 1024 limit is definitely hit under select()
when running Fabric against a large number of hosts in parallel (5k+).
The above code for io.py
has been tested in a production capacity, but I extended the poll implementation to the forwarder in context_managers.py
after seeing #51.
@jealouscloud there may be multiple sources of file descriptor leaks - in this instance the 1024 limit is definitely hit under
select()
when running Fabric against a large number of hosts in parallel (5k+).
One of the ways I troubleshot this issue was taking a ls -l
look at /proc/$pid/fd where $pid is the process ID of the primary/parent fabric process.
Before noticing the leak cause of the leak for #74 I recall noticing there was a big difference in fd allocation when I used --eagerly-disconnect. That may or may not have helped your case but if it did, that might have been part of intended functionality instead of an unintended leak of unused resources. That said, fd count absolutely scales with your -z value too, depending on what that might have been.
I'm not saying there aren't potentially multiple leaks but I do want to add what i've seen to the conversation.
All that said, I still think this is a good/necessary change, leak or no.