SIGABRT with gevent 25.4.1 / libev / invalid fd found
*** OS Debian 12.5 Python 3.11.2
apt-cache policy libevent-2.1-7 libevent-2.1-7: Installed: 2.1.12-stable-8
*** Same python component (linux systemd service), no change of any kind in the code base. We full checked the pip freeze, the only difference is
- before : gevent==24.11.1 : no crash
- crash : gevent==25.4.1 : crash
- we reverted to gevent==24.11.1 : no more crash
*** Logs (strace, all crashes similar) Dunno if the I/O watcher with invalid fd found in epoll_ctl is the cause of the SIGABRT
getpid() = 1017145 close(11) = 0 futex(0x29703f0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xa5b8d0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0xa5b8d8, FUTEX_WAKE_PRIVATE, 1) = 1 epoll_ctl(4, EPOLL_CTL_MOD, 11, {events=EPOLLIN, data={u32=11, u64=55834574859}}) = -1 EBADF (Bad file descriptor) write(2, "python: /tmp/build/gevent/deps/libev/ev_epoll.c:134: epoll_modify: Assertion `("libev: I/O watcher with invalid fd found in epoll_ctl", (__errno_location ()) != 9 && (__errno_location ()) != 40 && (*__errno_location ()) != 22)' failed.\n", 238) = 238 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4e335ff000 rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0 gettid() = 1017145 getpid() = 1017145 tgkill(1017145, 1017145, SIGABRT) = 0 --- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=1017145, si_uid=0} --- +++ killed by SIGABRT +++
I can't see any changes in gevent that could cause something like that. libev was untouched, the hub and all switching code was untouched, etc.
I'd need to see a minimal reproducer, or at the very least a Python traceback, to be able to offer any suggestions.
The only thing I can think of is that some things, specifically all the queue classes, are now cooperative when monkey-patched, and gevent.queue.SimpleQueue is now cooperative even when not monkey-patched. So the exact timing of switches may have changed, exposing a race-condition in the code where you're closing an in-use FD.
Ok copy, will revert here if i have more infos.
I also have the same error when trying to reload a flask app in debug mode. Created this repo to reproduce: https://github.com/ddorian/flask-reload-gevent-bug The error happens on 3.13.3 but doesn't happen on 3.12.10 on my machine.
The Flask/reload example from @ddorian is definitely a race condition. The main thread is in a loop looking for changed files; when it finds one, it closes the listening socket (thus invalidating the file descriptor). Meanwhile, there is a background thread that's actually doing the serving. It sits there repeatedly calling select.poll in a loop. If the socket.close() happens when the background thread is about to call select.poll, you get EBADF. Which libev doesn't like (when compiled with debugging enabled).
If you use the libuv loop, at least you get an exception instead of a process-ending assertion failure.
In the general case, I'm not sure there's a good way to mitigate such race conditions, but I'm thinking still
@jamadden Is it a true background thread running in parallel?
Monkey-patch socket.close() to add the fd to a queue structure before the actual close, to get it removed from the selector prior to every selector.poll(). Moreover, to make it even more robust:
- Monkey-patch
socket.close()to NOT close the socket at all, but to add the FD accompanied by a callback to a to-be-closed-queue to be consumed by the background thread. - Wake up the select in the background thread (might require adding a process-wide wake up pipe FD)
- The background thread consumes the FDs in the to-be-closed-queue, closes the sockets, schedules the callback with the exception (if any) to resume, goes back into the
select.pollloop. - The socket.close returns/raises and the greenlet continues.
Note that none of the code being discussed is gevent; it's the stdlib socketserver and werkzeug. We thus have limited control. Also note that socketserver is implemented using (specifically) selectors.PollSelector, and this race condition specifically violates the documentation for how to use selectors which states that you must unregister FDs before closing them. Last, whether or not this problem appears depends on the event loop being used, and what backing OS-implementation that loop is using; I've only proved Linux/epoll dangerous, and haven't produced an issue on macOS.
If you're suggesting keeping a global list of FDs active in selectors, we already do that. And if you try to close a socket in that list, we defer it already to a safer time. (We cannot always defer it; that breaks a number of tests because it keeps the socket open longer than desired.)
But that's not a solution to this case. Sometimes FDs are going to be used again by client code, but are not currently anywhere that gevent can see them and magically know that will be the case. Consider this (unfortunately common anti-) pattern:
fd = get_fd_from_somewhere()
while True:
with Selector():
selector.register(fd, READ)
ready = selector.select(timeout)
do_stuff(ready)
Or the even simpler use of select.select, which is what the above boils down to:
fd = get_fd_from_somewhere()
while True:
ready, _, _ = select.select([fd], (), (), timeout)
do_stuff(ready)
In either of those cases, fd is not in use during the call to do_stuff. And if during the time that do_stuff is executing fd gets closed, then when do_stuff returns and the loop repeats --- using the now-closed fd --- you're going to get an EBADF.
I believe I can check to see if the fd is still open before I pass in on to libuv/libev; using fstat() should accept any file descriptor and return EBADF if it's invalid. However:
- That adds a relatively expensive system call when starting watchers, which we do a lot. I have grave concerns about the performance implications, especially at higher scale that may be difficult to test.
- There's still a race condition, it's just much smaller.
FWIW, the stdlib selectors implementations all behave differently in the above scenario, when select is called with a FD that has been registered but is now closed.
- The
KQueueSelectorignores it, and waits the complete timeout before returning. - The
SelectSelectorraises EBADF immediately. - The
EPollSelectorignores it, and waits the complete timeout before returning. - The
PollSelectorignores it, but returns immediately.
Also, when given an open but not connected socket, KQueueSelector waits timeout before returning with no ready FDs. In contrast, EPollSelector returns immediately claiming the socket is ready to be read.
The accept loop in socketserver using the selector doesn't try to catch anything, so it just assumes (a) no race conditions closing; or (b) you won't actually be using the SelectSelector.
If I force socketserver to use gevent.selectors.GeventSelector instead of PollSelector, I haven't been able to reproduce this problem; I haven't deeply examined why, but I'm pretty sure it's because the GeventSelector is able to manage the watchers in a smarter way.
FWIW, the stdlib selectors implementations all behave differently
I've raised this issue with CPython in re selector closure as well: https://github.com/python/cpython/issues/91433
Ok, understand, thanks!
I've raised this issue with CPython in re selector closure
I completely agree that's a valid issue and the behaviour should be more standard. The difference is that here I'm talking about closing file objects (sockets) that are registered with a selector, and there you're talking about closing the selector itself.
It turns out that libuv actually makes that extra system call (usually 2!) every time it creates an IO watcher; that's what lets it raise an exception ultimately. The implementation is different for different platforms:
| System | Syscall |
|---|---|
| Linux | epoll_ctl (x2) |
| macOS | fstat, kevent |
| AIX | pollset_ctl (x2) |
| OS/390 | poll |
| Solaris | port_associate, port_dissociate |
| POSIX (other) | poll |
| Windows | ioctlsocket |
Given:
- Crashing the process is bad; actionable exceptions are better.
- The stdlib behaviour is all over the place. I prefer the
SelectSelectorbehaviour of raising an error, but ignoring the problems seems to be more common - The testing system calls are cheap enough for libuv, which backs node.js which has high performance targets.
- libuv also unconditionally makes at least one more syscall, to
ioctl. - libuv is the future; it's the default on Windows and one of these days it will be the default everywhere (I'd just like to have a Cython implementation in addition to the CFFI one), so these checks are going to happen.
I'm going to add code to the libev implementations to check the FD at the same point that libuv does; I'll make sure both implementations raise the same OSError: EBADF. Then I'll update gevent.select.poll, the implementation used by a monkey-patched PollSelector, to catch this exception and ignore it as apparently the stdlib is doing.
I can probably limit this behaviour change to just poll, but given the first point (Crashing is bad, mmkay), unless I run into widespread problems (and I shouldn't, or we'd be crashing all over the place; also libuv is the default on windows, so we'd be getting the un-catchaeble exception gevent.libuv.watcher.UVFuncallError: EBADF) I'll just make it the default.
This won't fix the race condition in a free-threaded world, just make it much shorter. But in a standard GIL build, that should be enough.
FWIW, the particular crash that @champax experienced only happens when libev is built with debugging assertions enabled. We don't distribute builds like that, so the installation of gevent had to come from somewhere else, possible a Linux distro. I always recommend installing from PyPI.
Our issue template used to request that information, but GitHub made a change and started ignoring our issue template some time ago...