flexx icon indicating copy to clipboard operation
flexx copied to clipboard

'Too many open files' error

Open matkuki opened this issue 2 years ago • 13 comments

Hi,

I have a long running application that works great, but recently I upgraded it a bit, and it now serves a bit more images through Tornado web server than before, and after users have connected/disconnected to it lot's of times, at random times the application crashes and throws:

...
Too many open files (src/epoll.cpp:65)
Aborted (core dumped)

The first thing I tried is to set the maximum open file limit on the Ubuntu Linux the application is running on to 10 times more, but with no success, still get the error.

Any suggestion of what I else I could try to fix this? Thanks

matkuki avatar Feb 10 '24 09:02 matkuki

Mmm, my first guess is that the websockets might not be cleaned up properly (sockets are file descriptors too).

Is there anything specific about the update? Would you expect more websocket connections after it? Or are the images served via regular http requests?

almarklein avatar Feb 10 '24 21:02 almarklein

Thanks for the quick reply.

Mmm, my first guess is that the websockets might not be cleaned up properly (sockets are file descriptors too).

Could be, yes, but currently I have no idea where this would occur.

Is there anything specific about the update? Would you expect more websocket connections after it? Or are the images served via regular http requests?

The biggest change is that previously the application was split into 2 sub-apps:

selection = flx.App(
   pages.Selection,
   title="App selection page"
)
selection .serve('Selection')
main= flx.App(
   pages.Main,
   title="Main page"
)
main.serve('Main')

and now it's just a single bigger app Main. But there has been no increase in user traffic, there is usually only one user logged in, but I have tested it with two users simultaneously to try to reproduce it, but I couldn't. This always happens to a specific user, at random times, using Google Chrome (stock Chrome, no special plug-ins). All users access the app either locally or via VPN, no direct internet access.

All the new extra images (majority are .svg) are referenced by flx.ImageWidget or in a <img> tag in a label, and they are served by Tornado:

tornado_app = flx.current_server().app
tornado_app.add_handlers(
r".*", 
[
    ...
    (
         r'/feather/(.*)',
         tornado.web.StaticFileHandler,
         {
              'path': os.path.join(working_directory, 'feather')
         }
    ),
    ...
]

The new widgets that were added in the upgrade do use a new method of creating an image on the fly if needed, in the Tornado directory, based on a monitored value and a limit value according to a scale (same is black, higher is redder, lower is bluer). But I tried continuously: loading the page with about 50 of these images, logging out, reloading the page with the 50 images, ... but could not reproduce this problem (mind that I didn't try this for long, about 10-20 reloads).

matkuki avatar Feb 10 '24 22:02 matkuki

I don't see any red flags from the info you provide. But I found this code:

import os
import subprocess

pid = os.getpid()
out = subprocess.check_output(['lsof', '-wXFn', '+p', str(pid)], stderr=subprocess.DEVNULL)
print(out.decode())

You can run that in the process itself (or from another process, and then using the pid of the process of interest). It produces something like the below. Which looks like it may give some hints of the kinds of files being open.

p23679
f0
n->0x220ba76885751d5b
f1
n->0x33f28c6cd5efaf9e
f2
n->0x33f28c6cd5efaf9e
f3
nlocalhost:56063->localhost:50370
f4
n->0x531e76fc0357fe36

almarklein avatar Feb 12 '24 12:02 almarklein

Hi @almarklein

Thanks for the suggestion. Here is the output of the command when two users are accessing the app simultaneously:

~$ sudo lsof -wXFn +p 1081
p1081
fcwd
n/home/test/app
frtd
n/
ftxt
n/usr/bin/python3.6
fmem
n/lib/x86_64-linux-gnu/libresolv-2.27.so
fmem
n/lib/x86_64-linux-gnu/libnss_dns-2.27.so
fmem
n/lib/x86_64-linux-gnu/libnss_files-2.27.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/_proxy_steerable.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/_device.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/_version.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/_poll.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/utils.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/socket.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/context.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/message.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/error.cpython-36m-x86_64-linux-gnu.so
fmem
n/lib/x86_64-linux-gnu/libgcc_s.so.1
fmem
n/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/.libs/libsodium-72341b7d.so.23.2.0
fmem
n/lib/x86_64-linux-gnu/librt-2.27.so
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/.libs/libzmq-39117701.so.5.2.1
fmem
n/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/constants.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/lib/python3/dist-packages/cryptography/hazmat/bindings/_openssl.abi3.so
fmem
n/usr/lib/python3/dist-packages/_cffi_backend.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/lib/python3/dist-packages/cryptography/hazmat/bindings/_constant_time.abi3.so
fmem
n/lib/x86_64-linux-gnu/libuuid.so.1.3.0
fmem
n/usr/lib/python3.6/lib-dynload/_csv.cpython-36m-x86_64-linux-gnu.so
fmem
n/lib/x86_64-linux-gnu/libtinfo.so.5.9
fmem
n/lib/x86_64-linux-gnu/libncursesw.so.5.9
fmem
n/usr/lib/python3.6/lib-dynload/_curses.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/lib/x86_64-linux-gnu/libffi.so.6.0.4
fmem
n/usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/local/lib/python3.6/dist-packages/tornado/speedups.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/lib/python3.6/lib-dynload/_json.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/lib/x86_64-linux-gnu/libssl.so.1.1
fmem
n/usr/lib/python3.6/lib-dynload/_ssl.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/lib/python3.6/lib-dynload/_asyncio.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/lib/python3.6/lib-dynload/_multiprocessing.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
fmem
n/usr/lib/python3.6/lib-dynload/_hashlib.cpython-36m-x86_64-linux-gnu.so
fmem
n/lib/x86_64-linux-gnu/liblzma.so.5.2.2
fmem
n/usr/lib/python3.6/lib-dynload/_lzma.cpython-36m-x86_64-linux-gnu.so
fmem
n/lib/x86_64-linux-gnu/libbz2.so.1.0.4
fmem
n/usr/lib/python3.6/lib-dynload/_bz2.cpython-36m-x86_64-linux-gnu.so
fmem
n/usr/lib/python3.6/lib-dynload/_opcode.cpython-36m-x86_64-linux-gnu.so
fmem
n/lib/x86_64-linux-gnu/libm-2.27.so
fmem
n/lib/x86_64-linux-gnu/libz.so.1.2.11
fmem
n/lib/x86_64-linux-gnu/libexpat.so.1.6.7
fmem
n/lib/x86_64-linux-gnu/libutil-2.27.so
fmem
n/lib/x86_64-linux-gnu/libdl-2.27.so
fmem
n/lib/x86_64-linux-gnu/libpthread-2.27.so
fmem
n/lib/x86_64-linux-gnu/libc-2.27.so
fmem
n/lib/x86_64-linux-gnu/ld-2.27.so
fmem
n/usr/lib/locale/locale-archive
fmem
n/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache
f0
npipe
f1
n/home/test/app/output.txt
f2
n/home/test/app/error.txt
f3
n[eventpoll]
f4
ntype=STREAM
f5
ntype=STREAM
f6
n/dev/urandom
f7
n/dev/random
f8
n/home/test/app/log.txt
f9
ncan't identify protocol (-X specified)
f10
ncan't identify protocol (-X specified)
f11
ncan't identify protocol (-X specified)
f12
ncan't identify protocol (-X specified)
f13
ncan't identify protocol (-X specified)
f14
ncan't identify protocol (-X specified)
f15
ncan't identify protocol (-X specified)
f16
ncan't identify protocol (-X specified)
f17
ncan't identify protocol (-X specified)
f18
ncan't identify protocol (-X specified)
f30
ncan't identify protocol (-X specified)
f31
ncan't identify protocol (-X specified)
f32
ncan't identify protocol (-X specified)
f33
ncan't identify protocol (-X specified)

matkuki avatar Feb 12 '24 13:02 matkuki

Not getting close to 1000 yet :P

almarklein avatar Feb 12 '24 14:02 almarklein

I have no solution to this yet, but it's a issue that appears sporadically and seemingly only when one user with a Chrome browser logs on, and even then, it doesn't happen always.

matkuki avatar Feb 17 '24 11:02 matkuki

I don't know what else to try, but am curious to what's going on. Please keep us posted!

almarklein avatar Feb 17 '24 23:02 almarklein

Hey @almarklein , I got more information on the problem. It seems that the trouble is caused by a single Windows machine that connects to the application. I tried a clean Windows machine on the same ethernet cable and it works without problems. So the user of the problem machine experimented a bit more. There are two things that stand out:

  • when the problem machine connects to the app, there are a lot of these type of file handles shown when observing with sudo lsof -wXFn +p <process_id>:
f3
n[eventpoll]
f4
ntype=STREAM
f5
ntype=STREAM
...
f2156
n[eventpoll]
f2157
ntype=STREAM
f2158
ntype=STREAM
  • some of the images do not load in the browser for this problem user, noticeable with the missing-image-icon, example: image The images that do not load seems completely random, although it's the same images every time the user connects to the app.

This was tested on the problem machine on Chrome, Edge and Firefox: same behaviour on all three. Also tried disabling the anti-virus, no difference.

Does this help in any way?

matkuki avatar Feb 19 '24 20:02 matkuki

Thanks for the extra info. Strange though, and interesting how its bound to a specific machine, but independent of the browser used ...

Perhaps, something related to the firewall or other networking settings? Maybe it somehow fails to download the image and repeatedly tries again?

almarklein avatar Feb 20 '24 08:02 almarklein

Could there be a way to detect this and disconnect the user if this happens? Or maybe disconnect all users if there are more than 500 file handles open? Is this possible from inside flexx?

matkuki avatar Feb 20 '24 14:02 matkuki

@almarklein Another update: another user in the same office space but from a different computer has the exact same problem. I'm beginning to think it's more and more likely that it's a firewall issue, like you said.

matkuki avatar Feb 20 '24 17:02 matkuki

The similarities between the two machines is that they are in the same network and they both have installed WithSecure (formerly F-Secure) anti-virus installed. Although the first user disabled this anti-virus to test if it would help, he could have not completely disabled all of it's functionality. We'll try uninstalling this anti-virus software to see if that might be the solution.

matkuki avatar Feb 21 '24 09:02 matkuki

Uninstalled WithSecure anti-virus, no changes.

matkuki avatar Feb 23 '24 12:02 matkuki

Update: Updated the Python version from 3.6.9 to 3.11.0 and now it works on almost all browsers like Firefox, Chrome, Chromium, Opera, Edge. Only one still has the problem, SeaMonkey.

matkuki avatar May 16 '24 22:05 matkuki