fab-classic icon indicating copy to clipboard operation
fab-classic copied to clipboard

Something is opening file descriptors and not closing them

Open brutog opened this issue 4 years ago • 4 comments

Easy to reproduce. Just run a fabric task across 1024+ hosts on. The per process file descriptor will trigger and the first place fab-classic breaks is on this select:

!!! Parallel execution exception under host 'test.host:
Process test.host:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/repo/python-pp-fabric/fabric/tasks.py", line 217, in _parallel_wrap
    queue.put({'name': name, 'result': task.run(*args, **kwargs)})
  File "/home/ubuntu/repo/python-pp-fabric/fabric/tasks.py", line 168, in run
    return self.wrapped(*args, **kwargs)
  File "/home/ubuntu/repo/python-pp-fabtools/pp_fabtools/system_cmds.py", line 20, in runcmd_as_root
    cmd = sudo(cmd)
  File "/home/ubuntu/repo/python-pp-fabric/fabric/network.py", line 700, in host_prompting_wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/repo/python-pp-fabric/fabric/operations.py", line 1130, in sudo
    capture_buffer_size=capture_buffer_size,
  File "/home/ubuntu/repo/python-pp-fabric/fabric/operations.py", line 923, in _run_command
    timeout=timeout, capture_buffer_size=capture_buffer_size)
  File "/home/ubuntu/repo/python-pp-fabric/fabric/operations.py", line 807, in _execute
    worker.raise_if_needed()
  File "/home/ubuntu/repo/python-pp-fabric/fabric/thread_handling.py", line 26, in raise_if_needed
    six.reraise(e[0], e[1], e[2])
  File "/home/ubuntu/.virtualenvs/python-pp-fabric/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/ubuntu/repo/python-pp-fabric/fabric/thread_handling.py", line 13, in wrapper
    callable(*args, **kwargs)
  File "/home/ubuntu/repo/python-pp-fabric/fabric/io.py", line 269, in input_loop
    r, w, x = select([f], [], [], 0.0)
ValueError: filedescriptor out of range in select()

As far as where this is happening, the best read I have on it seems like sys.stdout and/or other sys based output streams are being duplicated through the threadable call to io.py and output/input_looper functions.

I can say that this doesn't happen at all in the original fabric package. It also doesn't seem to happen under fab-classic using python2.7 - I've tested 3.6.9 and 3.7.5 and they both have this problem.

Doing something like this at the bottom of the tasks.execeute():

os.close(sys.stdin.fileno())
os.close(sys.stdout.fileno())
os.close(sys.stderr.fileno())

While silly, this stops the leak. It breaks all the other output of the program. There is a noted difference between fd.close() and os.close(fd).

The best I can tell is it seems like fd.close() leaves the C level file descriptor open in python 3.x but I'm not sure why yet.

brutog avatar Nov 18 '20 02:11 brutog

I've had some idea of re-implementing the parallel mode with threads, because the multiprocessing strategy doesn't work with python-3.8 (can't seem to pass references to functions decorated with @task to the forked process), and I think it never worked on windows.

Separately, there is the idea of replacing select() with poll(), on unix systems, to avoid the max-fd-value problem. Here's the idea on the paramiko side: https://github.com/ploxiln/paramiko-ng/pull/30

unfortunately I've been neglecting these two projects recently, with the usual excuse, busy at work on other stuff ...

ploxiln avatar Nov 18 '20 06:11 ploxiln

I did think about poll as well. Maybe I'll give something a shot if you'll accept a PR.

brutog avatar Nov 18 '20 07:11 brutog

Well, something as simple as this works in io.py:

        elif waitable:
            poller = poll()
            poller.register(f)
            if poller.poll(f.fileno()):
                byte = f.read(1)

File descriptors for the parent fab process still go over 1024 but fabric no longer bails. Not sure if something more robust than that is needed.

Looks context_managers.py uses select as well and could probably use the same treatment.

brutog avatar Nov 18 '20 08:11 brutog

Yup, feel free to open a PR. But notice that the paramiko PR checks whether poll() is available and falls back to select() (primarily for Windows compatibility).

ploxiln avatar Nov 19 '20 01:11 ploxiln