fab-classic
fab-classic copied to clipboard
Something is opening file descriptors and not closing them
Easy to reproduce. Just run a fabric task across 1024+ hosts on. The per process file descriptor will trigger and the first place fab-classic breaks is on this select:
!!! Parallel execution exception under host 'test.host:
Process test.host:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/repo/python-pp-fabric/fabric/tasks.py", line 217, in _parallel_wrap
queue.put({'name': name, 'result': task.run(*args, **kwargs)})
File "/home/ubuntu/repo/python-pp-fabric/fabric/tasks.py", line 168, in run
return self.wrapped(*args, **kwargs)
File "/home/ubuntu/repo/python-pp-fabtools/pp_fabtools/system_cmds.py", line 20, in runcmd_as_root
cmd = sudo(cmd)
File "/home/ubuntu/repo/python-pp-fabric/fabric/network.py", line 700, in host_prompting_wrapper
return func(*args, **kwargs)
File "/home/ubuntu/repo/python-pp-fabric/fabric/operations.py", line 1130, in sudo
capture_buffer_size=capture_buffer_size,
File "/home/ubuntu/repo/python-pp-fabric/fabric/operations.py", line 923, in _run_command
timeout=timeout, capture_buffer_size=capture_buffer_size)
File "/home/ubuntu/repo/python-pp-fabric/fabric/operations.py", line 807, in _execute
worker.raise_if_needed()
File "/home/ubuntu/repo/python-pp-fabric/fabric/thread_handling.py", line 26, in raise_if_needed
six.reraise(e[0], e[1], e[2])
File "/home/ubuntu/.virtualenvs/python-pp-fabric/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/home/ubuntu/repo/python-pp-fabric/fabric/thread_handling.py", line 13, in wrapper
callable(*args, **kwargs)
File "/home/ubuntu/repo/python-pp-fabric/fabric/io.py", line 269, in input_loop
r, w, x = select([f], [], [], 0.0)
ValueError: filedescriptor out of range in select()
As far as where this is happening, the best read I have on it seems like sys.stdout and/or other sys based output streams are being duplicated through the threadable call to io.py and output/input_looper functions.
I can say that this doesn't happen at all in the original fabric package. It also doesn't seem to happen under fab-classic using python2.7 - I've tested 3.6.9 and 3.7.5 and they both have this problem.
Doing something like this at the bottom of the tasks.execeute():
os.close(sys.stdin.fileno())
os.close(sys.stdout.fileno())
os.close(sys.stderr.fileno())
While silly, this stops the leak. It breaks all the other output of the program. There is a noted difference between fd.close() and os.close(fd).
The best I can tell is it seems like fd.close() leaves the C level file descriptor open in python 3.x but I'm not sure why yet.
I've had some idea of re-implementing the parallel mode with threads, because the multiprocessing strategy doesn't work with python-3.8 (can't seem to pass references to functions decorated with @task
to the forked process), and I think it never worked on windows.
Separately, there is the idea of replacing select()
with poll()
, on unix systems, to avoid the max-fd-value problem. Here's the idea on the paramiko side: https://github.com/ploxiln/paramiko-ng/pull/30
unfortunately I've been neglecting these two projects recently, with the usual excuse, busy at work on other stuff ...
I did think about poll as well. Maybe I'll give something a shot if you'll accept a PR.
Well, something as simple as this works in io.py:
elif waitable:
poller = poll()
poller.register(f)
if poller.poll(f.fileno()):
byte = f.read(1)
File descriptors for the parent fab process still go over 1024 but fabric no longer bails. Not sure if something more robust than that is needed.
Looks context_managers.py uses select as well and could probably use the same treatment.
Yup, feel free to open a PR. But notice that the paramiko PR checks whether poll()
is available and falls back to select()
(primarily for Windows compatibility).