pyzmq icon indicating copy to clipboard operation
pyzmq copied to clipboard

Running out of file descriptors on update to 17.0

Open ellisonbg opened this issue 7 years ago • 16 comments

@minrk @SylvainCorlay I updated to version 17.0 using conda (looks like it is picking it up from QuantStack. I am seeing errors after running the notebook for a bit running out of file descriptors:

[E 21:43:03.681 LabApp] Uncaught exception in /api/kernels/c8737a64-5828-49d5-9972-b9ee7aa6a305/channels
    Traceback (most recent call last):
      File "/Users/bgranger/anaconda3/lib/python3.6/site-packages/tornado/websocket.py", line 494, in _run_callback
      File "/Users/bgranger/anaconda3/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 275, in open
      File "/Users/bgranger/anaconda3/lib/python3.6/site-packages/notebook/services/kernels/handlers.py", line 129, in create_stream
      File "/Users/bgranger/anaconda3/lib/python3.6/site-packages/jupyter_client/multikernelmanager.py", line 33, in wrapped
      File "/Users/bgranger/anaconda3/lib/python3.6/site-packages/jupyter_client/ioloop/manager.py", line 22, in wrapped
      File "/Users/bgranger/anaconda3/lib/python3.6/site-packages/jupyter_client/connect.py", line 559, in connect_shell
      File "/Users/bgranger/anaconda3/lib/python3.6/site-packages/jupyter_client/connect.py", line 543, in _create_connected_socket
      File "/Users/bgranger/anaconda3/lib/python3.6/site-packages/zmq/sugar/context.py", line 146, in socket
      File "/Users/bgranger/anaconda3/lib/python3.6/site-packages/zmq/sugar/socket.py", line 58, in __init__
      File "zmq/backend/cython/socket.pyx", line 330, in zmq.backend.cython.socket.Socket.__init__
    zmq.error.ZMQError: Too many open files

Any thoughts?

ellisonbg avatar Apr 25 '18 04:04 ellisonbg

I should note that, in the grand history and tradition of @fperez that I discovered this while urgently preparing a talk ;-) how else would we find bugs

I can survive as the notebook server will likely run long enough for an hour talk/demo, so there isn't anything urgent to fix this before my talk.

ellisonbg avatar Apr 25 '18 04:04 ellisonbg

If there is a bug, this is likely in the notebook server opening and not closing sockets. There is a maximum number of files that can be open before the system starts preventing new sockets from being opened.

You can up the limit with:

ulimit -n 4096

minrk avatar Apr 25 '18 10:04 minrk

If there is an FD leak bug in pyzmq, it might be in the new asyncio integration, I will double-check to make sure that's not it, but the notebook server not closing sockets it uses has happened before, so I'm guessing that's it.

minrk avatar Apr 25 '18 10:04 minrk

Thanks Min, I will keep monitoring the situation and open an issue on the notebook repo.

On Wed, Apr 25, 2018 at 3:59 AM, Min RK [email protected] wrote:

If there is an FD leak bug in pyzmq, it might be in the new asyncio integration, I will double-check to make sure that's not it, but the notebook server not closing sockets it uses has happened before, so I'm guessing that's it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zeromq/pyzmq/issues/1170#issuecomment-384247772, or mute the thread https://github.com/notifications/unsubscribe-auth/AABr0GC9B0iOQV0JzXtv4mN-VttnBcHJks5tsFcngaJpZM4Tixhm .

-- Brian E. Granger Associate Professor of Physics and Data Science Cal Poly State University, San Luis Obispo @ellisonbg on Twitter and GitHub [email protected] and [email protected]

ellisonbg avatar May 01 '18 18:05 ellisonbg

There is definitely something in v17 that leaves some file descriptors open where v16 did not.

In osBrain the same test suite passes when pyzmq is set to <v17 (v16), but breaks with v17.

The tests that fail are precisely those checking no file descriptors are left open (all IPC sockets are closed) under different scenarios.

No Jupyter involved here.

@minrk Did you find anything suspicious in the new asyncio integration? Or have any idea about what could be causing this?

Peque avatar Jul 23 '18 13:07 Peque

This issue hit me during my demo at Scipy. It causes the notebook server to crash.

On the top of it, jupyterlab tries to reopen all the tabs and recreates all the file descriptors, which makes it impossible to restart without clearing everything.

S.

On Mon, Jul 23, 2018, 15:48 Miguel Sánchez de León Peque < [email protected]> wrote:

There is definitely something in v17 that leaves some file descriptors open where v16 did not.

In osBrain https://github.com/opensistemas-hub/osbrain the same test suite passes https://travis-ci.org/opensistemas-hub/osbrain/builds/407155149 when pyzmq is set to <v17 (v16), but breaks https://travis-ci.org/opensistemas-hub/osbrain/builds/405590284 with v17.

The tests that fail are precisely those checking no file descriptors are left open (all IPC sockets are closed) under different scenarios.

No Jupyter involved here.

@minrk https://github.com/minrk Did you find anything suspicious in the new asyncio integration? Or have any idea about what could be causing this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/zeromq/pyzmq/issues/1170#issuecomment-407063941, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSXFlkz0sZy386KMHD6gZpFBWgU99NUks5uJdQ7gaJpZM4Tixhm .

SylvainCorlay avatar Jul 23 '18 14:07 SylvainCorlay

Cross reference: https://github.com/jupyter/notebook/issues/3748

Peque avatar Jul 30 '18 09:07 Peque

In exploring with the notebook, I have yet to be able to provoke an FD leak (launching and stopping several hundred notebooks works fine, and drops back down to ~30 FDs). Similarly, I cannot trigger an FD leak directly with pyzmq yet. What I can see is that there is a large number of FDs are created, so it's possible that there is no leak here, but instead purely an increase in the number of FDs per connection, which brings the number of open notebooks needed to hit the existing limit lower (this has always been an issue on macs since 0.12 of the notebook, which have a low default FD limit).

@SylvainCorlay's experience supports this conclusion, because restarting the notebook server did not fix the problem, indicating that it is not actually a leak (which would take reopens/restarts to trigger), but instead purely a large number of FDs involved. I'll need to investigate further to identify where exactly all the FDs come from, and if any can be avoided and/or closed earlier.

minrk avatar Jul 30 '18 09:07 minrk

@minrk Note that in my experience there is a FD leak. osBrain makes direct use of PyZMQ and the current test suite is failing due to finding IPC socket files left open when they should have been cleaned up. It worked fine with v16 but does not with v17. After running osBrain test suite locally against PyZMQ 17, 166 files are left open. If I run it again, the count goes up to 330 and so on. I clean them manually but otherwise I would hit, eventually, a FD limit.

Not sure how to create a reduced, complete, reproducible case, but if you want to run our test suite to see the results for yourself, the project is open source and uses standard Tox for running tests. You would only need to edit setup.py to remove the pyzmq<17 requirement.

Peque avatar Jul 30 '18 10:07 Peque

The osbrain case is also likely not an FD leak. Leaving IPC files around after closing sockets is not the same as leaking file descriptors, which are open files or sockets in the process.

Your left-around files are caused by upgrading libzmq (pyzmq 17 bundles libzmq 4.2, whereas pyzmq 16 bundled libzmq 4.1), and it looks like the cause is libzmq no longer unlinks files on close in order to fix this issue. The tests should be fixed by https://github.com/opensistemas-hub/osbrain/pull/305

minrk avatar Jul 30 '18 11:07 minrk

Thanks a lot for your help @minrk! :blush:

Peque avatar Jul 30 '18 12:07 Peque

This error hit me when I even only have 2 notebooks open in Jupyter lab...

YubinXie avatar Jun 05 '19 21:06 YubinXie

This is still a problem in pyzmq 18.0.1.

jbulow avatar Jun 25 '19 12:06 jbulow

This also happens to me constantly when using ipyparallel:

[ProfileCreate] Generating default config file: '/home/basnijholt/.ipython/profile_adaptive_scheduler_132388.hpc05.hpc/ipython_config.py'
[ProfileCreate] Generating default config file: '/home/basnijholt/.ipython/profile_adaptive_scheduler_132388.hpc05.hpc/ipython_kernel_config.py'
Process Process-2:
Traceback (most recent call last):
  File "/home/basnijholt/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
  File "/home/basnijholt/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/devices/basedevice.py", line 222, in run
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/devices/monitoredqueuedevice.py", line 26, in run_device
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/devices/proxydevice.py", line 57, in _setup_sockets
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/devices/basedevice.py", line 191, in _setup_sockets
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/sugar/context.py", line 146, in socket
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/sugar/socket.py", line 59, in __init__
  File "zmq/backend/cython/socket.pyx", line 328, in zmq.backend.cython.socket.Socket.__init__
zmq.error.ZMQError: Too many open files
Too many open files (src/epoll.cpp:65)
Process Process-3:
Traceback (most recent call last):
  File "/home/basnijholt/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
  File "/home/basnijholt/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/devices/basedevice.py", line 222, in run
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/devices/monitoredqueuedevice.py", line 26, in run_device
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/devices/proxydevice.py", line 57, in _setup_sockets
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/devices/basedevice.py", line 187, in _setup_sockets
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/sugar/context.py", line 146, in socket
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/zmq/sugar/socket.py", line 59, in __init__
  File "zmq/backend/cython/socket.pyx", line 328, in zmq.backend.cython.socket.Socket.__init__
zmq.error.ZMQError: Too many open files
Process Process-1:
Traceback (most recent call last):
  File "/home/basnijholt/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/basnijholt/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/ipyparallel/controller/scheduler.py", line 805, in launch_scheduler
    loop = ioloop.IOLoop()
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/tornado/util.py", line 286, in __new__
    instance.initialize(*args, **init_kwargs)
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 253, in initialize
    loop = asyncio.new_event_loop()
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/events.py", line 762, in new_event_loop
    return get_event_loop_policy().new_event_loop()
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/events.py", line 660, in new_event_loop
    return self._loop_factory()
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/unix_events.py", line 51, in __init__
    super().__init__(selector)
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/selector_events.py", line 55, in __init__
    self._make_self_pipe()
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/selector_events.py", line 102, in _make_self_pipe
    self._ssock, self._csock = socket.socketpair()
  File "/home/basnijholt/miniconda3/lib/python3.7/socket.py", line 491, in socketpair
    a, b = _socket.socketpair(family, type, proto)
OSError: [Errno 24] Too many open files
Exception ignored in: <function BaseEventLoop.__del__ at 0x2ab118dba730>
Traceback (most recent call last):
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/base_events.py", line 625, in __del__
    self.close()
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/unix_events.py", line 55, in close
    super().close()
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/selector_events.py", line 86, in close
    self._close_self_pipe()
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/selector_events.py", line 93, in _close_self_pipe
    self._remove_reader(self._ssock.fileno())
AttributeError: '_UnixSelectorEventLoop' object has no attribute '_ssock'
[proxy:0:0@n05-13] HYDU_create_process (utils/launch/launch.c:25): pipe error (Too many open files)
[proxy:0:0@n05-13] launch_procs (pm/pmiserv/pmip_cb.c:708): create process returned error
[proxy:0:0@n05-13] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:894): launch_procs returned error
[proxy:0:0@n05-13] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@n05-13] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
Host key verification failed.
[NbConvertApp] Converting notebook into notebook
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 330, in get_event_loop
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/events.py", line 644, in get_event_loop
RuntimeError: There is no current event loop in thread 'Thread-1'.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/basnijholt/miniconda3/lib/python3.7/threading.py", line 917, in _bootstrap_inner
  File "/home/basnijholt/miniconda3/lib/python3.7/threading.py", line 865, in run
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/ipyparallel/client/client.py", line 902, in _io_main
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/ipyparallel/client/client.py", line 855, in _make_io_loop
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 335, in get_event_loop
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/events.py", line 660, in new_event_loop
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/unix_events.py", line 51, in __init__
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/selector_events.py", line 52, in __init__
  File "/home/basnijholt/miniconda3/lib/python3.7/selectors.py", line 349, in __init__
OSError: [Errno 24] Too many open files

Exception ignored in: <function BaseEventLoop.__del__ at 0x2b9ce9dc71e0>
Traceback (most recent call last):
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/base_events.py", line 625, in __del__
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/unix_events.py", line 55, in close
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/selector_events.py", line 86, in close
  File "/home/basnijholt/miniconda3/lib/python3.7/asyncio/selector_events.py", line 93, in _close_self_pipe
AttributeError: '_UnixSelectorEventLoop' object has no attribute '_ssock'
Traceback (most recent call last):
  File "run_learner.py", line 40, in <module>
    executor = connect_to_ipyparallel(profile=args.profile, n=args.n)
  File "/home/basnijholt/Work/adaptive-scheduler/adaptive_scheduler/utils.py", line 271, in connect_to_ipyparallel
    client = Client(profile=profile, **(client_kwargs or {}))
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/ipyparallel/client/client.py", line 519, in __init__
    self._connect(sshserver, ssh_kwargs, timeout)
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/ipyparallel/client/client.py", line 669, in _connect
    self._start_io_thread()
  File "/home/basnijholt/miniconda3/lib/python3.7/site-packages/ipyparallel/client/client.py", line 896, in _start_io_thread
    raise RuntimeError("IO Loop failed to start")
RuntimeError: IO Loop failed to start
[mpiexec@n05-13] HYDU_sock_write (utils/sock/sock.c:294): write error (Bad file descriptor)
[mpiexec@n05-13] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:177): unable to write data to proxy
[mpiexec@n05-13] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream
[mpiexec@n05-13] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@n05-13] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec@n05-13] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion

I am not sure how to debug this...

basnijholt avatar Nov 20 '19 12:11 basnijholt

I can confirm this problem as well. I tried porting my code to reuse sockets but that did not help.

StuartFarmer avatar Mar 17 '20 21:03 StuartFarmer

Updating to PyZMQ 19.0.0 fixed this.

StuartFarmer avatar Mar 18 '20 22:03 StuartFarmer

Closing here, as it's yet to be shown that pyzmq is leaking FDs, rather than an increase in FDs makes the problem more likely in other applications. So far, every bit of evidence points to applications (notebook, ipyparallel, etc.) not always cleaning up sockets, and there's nothing pyzmq can do about that.

ipyparallel, at least, has gotten a lot better at cleaning up after itself, and I think jupyter-server has, as well. Closing optimistically.

minrk avatar Feb 28 '24 11:02 minrk