trio optional daemonic threads for trio.to_thread.run

optional daemonic threads for trio.to_thread.run_sync

Open graingert opened this issue 3 years ago • 14 comments

daemon threads skip cleanup at interpreter shutdown and don't play well with subinterpreters

trio should maintain two pools of threads, daemonic and non-daemonic so trio can push that choice onto users

eg: trio.to_thread.run_sync(fn, daemon=False)

Jun 23 '21 21:06 graingert

Can you elaborate on the actual downsides? The threads don't need cleanup, and how do they not play will with subinterpreters?

(Also, is there any reason to care about subinterpreter support in the first place?)

Jun 23 '21 22:06 njsmith

The threads don't need cleanup

Some threads can still acquire resources that need to be waited for on interpreter shutdown

re subinterpreters:

10:04:46 <graingert> vstinner: are daemon threads allowed or not allowed in PEP 554 un-isolated sub-interpreters? 
10:05:03 <graingert> I'm trying to follow the allow/disallow tickets but the reverts of reverts is confusing me
10:05:10 <graingert> context is https://github.com/python-trio/trio/issues/2046
10:08:07 <vstinner> graingert: i don't recall. i strongly recommends against using them in subinterpreters

I'd really like subinterpreter support to be able to use multi-cpu hypercorn without multiprocessing where possible

Jun 24 '21 09:06 graingert

Some threads can still acquire resources that need to be waited for on interpreter shutdown

Do you have an example in mind? How does this cause problems for Trio's current use of daemonic threads, and what would the alternative be?

<vstinner> graingert: i don't recall. i strongly recommends against using them in subinterpreters

@graingert linked some more context in chat: https://vstinner.github.io/threading-shutdown-race-condition.html#forbid-daemon-threads-in-subinterpreters

@richardsheridan also pointed to https://bugs.python.org/issue40234, and I think Graham Dumpleton's response is particularly relevant.

I respect @vstinner a lot but daemon threads are a totally standard thing that every threading system supports, including Python's, and like Graham points out, that's because they're kind of important? So I think Victor and the subinterpreter folks are going to have to hold their nose and figure this one out, even if it sucks.

In any case, subinterpreters are not exactly a real thing currently, and multi-core subinterpreter support is definitely not a real thing, and personally I still think the odds are heavily stacked against it ever becoming a real thing. And if it ever does become a real thing, then that will just increase the pressure on subinterpreters to support stuff like daemon threads, or at least provide new APIs to handle the cases where we use daemon threads currently. So... I don't think we should make our code more complicated and fragile now to avoid a theoretical future problem that might never even happen – if daemon threads in subinterpreters ever become a real problem for Trio, we can always revisit.

Jun 24 '21 13:06 njsmith

the problem is most sharp with anyio where asyncio users are used to the non-daemonic, cleaned up atexit asyncio.to_thread and trio users expect the opposite

Jun 24 '21 13:06 graingert

I respect @vstinner a lot but daemon threads are a totally standard thing that every threading system supports, including Python's, and like Graham points out, that's because they're kind of important? So I think Victor and the subinterpreter folks are going to have to hold their nose and figure this one out, even if it sucks.

Currently, Py_EndInterpreter() causes a fatal error if a daemon thread is still running while a subinterpreter exits:

    if (tstate != interp->tstate_head || tstate->next != NULL) {
        Py_FatalError("not the last thread");
    }

It makes Py_EndInterpreter() not reliable at all: it depends if a daemon thread completes or not.

To support daemon threads in the main interpreter, ceval.c must exit immediately daemon threads as soon as they try to acquire the GIL: see tstate_must_exit() in Python/ceval_gil.h. Currently, this code is specific to the main interpreter, and so don't solve Py_EndInterpreter() issue.

Another concern is that Python doesn't solve if a daemon thread is still using an object or not when Py_EndInterpreter() is called. Pseudo-code:

buffer = PyByteArray_GET_STRING(bytearray);
len = PyByteArray_GET_SIZE(bytearray);
Py_BEGIN_ALLOW_THREADS  // release the GIL
n = read(fd, buffer, len); // run with the GIL released
Py_END_ALLOW_THREADS  // acquire the GIL

If Py_EndInterpreter() destroys the bytearray object of the daemon thread, read() can write into freed memory: you can a crash. asyncio still has a corner case issue in overlapped_dealloc():

static void
overlapped_dealloc(OverlappedObject *self)
{
    DWORD bytes;
    int err = GetLastError();

    PyObject_GC_UnTrack(self);
    if (self->pending) {
        if (check_CancelIoEx() &&
            Py_CancelIoEx(self->handle, &self->overlapped) &&
            GetOverlappedResult(self->handle, &self->overlapped, &bytes, TRUE))
        {
            /* The operation is no longer pending -- nothing to do. */
        }
        else if (_Py_IsFinalizing())
        {
            /* The operation is still pending -- give a warning.  This
               will probably only happen on Windows XP. */
            PyErr_SetString(PyExc_RuntimeError,
                            "I/O operations still in flight while destroying "
                            "Overlapped object, the process may crash");
            PyErr_WriteUnraisable(NULL);
        }
        else
        {
            /* The operation is still pending, but the process is
               probably about to exit, so we need not worry too much
               about memory leaks.  Leaking self prevents a potential
               crash.  This can happen when a daemon thread is cleaned
               up at exit -- see #19565.  We only expect to get here
               on Windows XP. */
            CloseHandle(self->overlapped.hEvent);
            SetLastError(err);
            return;
        }
    }

    CloseHandle(self->overlapped.hEvent);
    SetLastError(err);
    if (self->write_buffer.obj)
        PyBuffer_Release(&self->write_buffer);
    Py_CLEAR(self->read_buffer);
    PyTypeObject *tp = Py_TYPE(self);
    tp->tp_free(self);
    Py_DECREF(tp);
}

If Py_EndInterpreter() leaves the bytearray object alive: you leak memory.

Daemon threads provide a weird contract: don't care if Python exits, cleanup everything magically, and hope that it will work. It doesn't work. Magic doesn't exist.

I would like to deny daemon threads in subinterpreters to make Py_EndInterpreter() safe again. If tomorrow, someone finds a clever trick to implement them: good. But today, there is a major design flaw.

Jun 28 '21 08:06 vstinner

By the way, I'm against daemon threads in general, also in the main interpreter. There is no way to make such beasts fully safe. It's full of corner cases and make crashes at exit way more likely.

Jun 28 '21 08:06 vstinner

@vstinner Hmm, I do see what you mean. In the main interpreter, you can absolutely make them fully safe, because the OS will help you -- process exit atomically destroys all threads and objects simultaneously, so there's no way for things to leak or get used-after-free. That's the basic problem that subinterpreters have always had... you're trying to reimplement subprocess semantics in userspace, but it's never going to work as well as actual OS subprocesses, because the kernel has resources that you don't.

If you implement subinterpreters as subprocesses, then daemon threads will work fine :-). Or, I guess you could give up on Py_EndInterpreter? I can see how it might have been a good idea to design Python's semantics so it was possible to implement Py_EndInterpreter correctly, but it seems like that ship has sailed whether we like it or not.

Jun 28 '21 09:06 njsmith

As a concrete example of why OS-level daemon threads are necessary: whenever asyncio or trio does a DNS lookup, it does it in a worker thread, because getaddrinfo is inherently a blocking interface and for various reasons you have to use it. This is the key reason why all async libs must have some kind of defer-to-thread support to be useful at all.

Now, what happens if you want to cancel the getaddrinfo call? You can't actually stop it from completing, because there's no OS interface for that. But you can abandon that thread to keep running in the background and ignore the result, and since getaddrinfo doesn't have any side-effects, that's basically equivalent. (If you're limiting the total number of getaddrinfo threads then this might cause some future getaddrinfo calls to slow down for a bit waiting for thread availability, but that's a minor and localized side-effect -- not a big deal.)

So here's the problem: whenever your program finishes, there might be stalled getaddrinfo threads still hanging around in the background. You don't want the program to hang indefinitely at exit waiting for them to finish. With daemon threads, this is no problem -- the OS just zaps them immediately. Without daemon threads, what do you do?

Jun 28 '21 12:06 njsmith

asyncio uses:

https://github.com/python/cpython/blob/efe7d08d178a7c09bcca994f2068b019c8633d83/Lib/asyncio/base_events.py#L848-L856

where None uses the default executor which defaults to:

https://github.com/python/cpython/blob/efe7d08d178a7c09bcca994f2068b019c8633d83/Lib/asyncio/base_events.py#L808-L811

which since https://bugs.python.org/issue39812 does not use daemon threads

Jun 28 '21 12:06 graingert

@vstinner Hmm, I do see what you mean. In the main interpreter, you can absolutely make them fully safe, because the OS will help you -- process exit atomically destroys all threads and objects simultaneously

No, the OS doesn't help. Python doesn't attempt to kill daemon threads in Py_Finalize(). It uses tstate_must_exit() to call pthread_exit() if a daemon thread tries to (re)acquire the GIL whereas Python is exiting / exited. This work has been done because we got many crashes related to daemon threads (without subinterpeters).

The problem of tstate_must_exit() is that it currently uses a global variable to check if Python is exiting, and this function is not compatible with subinterpreters. It would be possible to make it compatible with subinterpreters. It's just that nobody decided to work on this issue yet.

Note: pthread_exit() comes with its own set of issues: see https://bugs.python.org/issue42969 and https://bugs.python.org/issue44434 There are technical reasons why daemon threads are hard to be implemented correctly.

Jun 28 '21 23:06 vstinner

Now, what happens if you want to cancel the getaddrinfo call? You can't actually stop it from completing, because there's no OS interface for that.

pthread_kill() can be used to kill a thread, but I'm not sure that there is a portable way to kill a thread, so Python doesn't rely on it. I think that the most portable option is to spawn a child process and kill it to cancel a task.

A better solution would be to use an asynchronous implementation of a DNS resolver (there are some written in Python). Or implement an asynchronous name resolver in the C library.

Jun 28 '21 23:06 vstinner

pthread_kill() can be used to kill a thread

Alas, pthread_kill is not usable in practice -- especially not for killing some random C code that you don't control, like getaddrinfo. It's one of those POSIX APIs that was spec'ed because "wouldn't it be nice", but never really worked out. And yeah, there's nothing like it on Windows either.

I think that the most portable option is to spawn a child process and kill it to cancel a task.

Yeah, but no-one would accept spawning child processes for every hostname lookup :-). I guess you could get it down to 1 subprocess for all the DNS lookups in a single subinterpreter but... by the time you make that reliable, you might as well just use the subprocess for the entire subinterpreter instead?

A better solution would be to use an asynchronous implementation of a DNS resolver (there are some written in Python).

Absolutely, in theory. In practice, it turns out that this isn't really viable, because getaddrinfo has special secret knowledge about the local system's configuration. You can query a DNS server, that's not too hard. But maybe getaddrinfo has been configured to use mDNS for certain suffixes, or also does a NETBIOS query, or like... on Linux with glibc, a totally standard, supported thing to do is to drop in some random .so file and use the quirky nsswitch.conf language to instruct getaddrinfo to load that shared library into every process and use it for name resolution... it's a huge mess.

Custom DNS resolvers are totally useful as an option -- they can work great in some configurations. But you can't count on them. Pretty much everyone who's ever built a networking library has gone down the path of getting cranky about getaddrinfo, trying to kill it, and then eventually giving up. You cannot escape getaddrinfo. Even Go uses getaddrinfo in many situations, and they hate using platform C code.

Or implement an asynchronous name resolver in the C library.

This would be fabulous of course, but it's not realistic -- no major OS offers this today, none of them have plans to, and it takes years for new OS features to percolate out to users.

This work has been done because we got many crashes related to daemon threads (without subinterpeters). [...] There are technical reasons why daemon threads are hard to be implemented correctly.

Yeah, I totally believe this. I'm not claiming daemon threads are like, objectively a good thing! Everything I just wrote about is a fractal of interlocking sadness and frustration. Just... daemon threads are what the industry has collectively settled on as the frustrating feature that has to be supported to make this stuff work. Sorry to be the bearer of bad news :-/

Jun 29 '21 03:06 njsmith

Daemon threads (which pthreads calls detached threads) also seem to be a feature put into posix "because it would be nice" that didn't think through the full implications... Why Python supported the concept is presumably simply because it existed in pthreads and sounded nice. The implications weren't really thought through. Code executing in your address space that you can never control or stop once you start it? What could possibly go wrong? :P The concept makes safe sense only in the simplest of cases in C where the code is constrained and not complex and calling off into other libraries that could potentially have concurrency and global state needs. A Python interpreter is by definition complex and has all of those.

ie: if you want a safe as possible daemon thread for calling libc getaddrinfo, write a C function and spawn it as a daemon thread, leaving the result somewhere and communicating a "done" event entirely from C. Rather than using a Python daemon thread because that is easy. No code author really wants to do this, but it'd make for a pretty simple async_getaddrinfo extension module that could be its own PyPI package...

I'm glad you brought up what golang does, because they're where I would've looked to next as they're understandably libc averse. The comments atop https://golang.org/src/net/net.go explain their current logic. It is indeed a pain and libc getaddrinfo simply cannot be avoided in significant common real world scenarios. The cgo getaddrinfo calling code limits the number of getaddrinfo threads that can run at once to avoid cascading failure (blocking the channel when there are too many). https://golang.org/src/net/net.go?s=21309:21329#L658 acquireThread called from https://golang.org/src/net/cgo_unix.go getaddrinfo calling functions. But that's about it. Effectively the same thing that asyncio does by using a threadpool per graingert's links above.

Simplest: Just don't use daemon threads for Python getaddrinfo calls. Use normal threads. Ideally pooled with a hard limit on number pending before even adding to the pending pool blocks. Ordinarily getaddrinfo calls aren't supposed to take a long time. They can, but this is rare. Don't optimize for the rare instant application termination desire in this case, just allow the "hang" to happen. If this turns out annoying to users, they can use strace or ltrace or their platform equivalents to see what's up and point the finger at "it's always DNS (or really the abstraction one layer up known as getaddrinfo logic)". If you need a program to exit immediately without waiting for any cleanup, that's what os._exit() is for. And at the command line that's what we're all trained to hit Ctrl-\ for when Ctrl-C doesn't satisfy our patience level.

Jul 04 '21 19:07 gpshead

Code executing in your address space that you can never control or stop once you start it? What could possibly go wrong?

I mean, yes, that is the basic problem with threads :-). But what does it have to do with the daemon/joinable distinction? Joinable threads have a join handle, but that's just an event + an int. It doesn't let you affect the thread at all, and you can emulate it trivially with a daemon thread. (In fact IIRC that's exactly what CPython does -- joinable threading.Threads are detached POSIX threads underneath, right?)

AFAIK the only fundamental difference between joinable and detached threads is that if your main thread exits without waiting for the thread, then the runtime will automatically join any joinable threads, but if there are detached threads then it just _exits. Trio uses daemon threads because they're more controllable and flexible than joinable threads.

ie: if you want a safe as possible daemon thread for calling libc getaddrinfo, write a C function and spawn it as a daemon thread, leaving the result somewhere and communicating a "done" event entirely from C. Rather than using a Python daemon thread because that is easy.

It's not so much that it's easy, as that Trio already has its own high-quality and deeply-integrated threading code, and tacking on some external thread pool for a subset of operations would just make things jankier for no benefit. In fact, glibc and Windows both have built-in APIs for shoving getaddrinfo off into a thread (getaddrinfo_a and GetAddrInfoEx respectively), but I've read a lot of event loop code, and I've never seen a single program that actually uses these "convenience" APIs.

If hypothetically we did want to port Trio to work with subinterpreters, I guess the most natural way to handle this would be to move our thread cache into C, so we could keep a single process-global set of detached worker threads. Then individual subinterpreters could "check out" a thread whenever they had a job to do, do the PyThreadState dance to switch into the appropriate context while running the job, and then return the thread to the pool. But of course then you have to figure out what happens if someone tries to call Py_EndInterpreter while the thread is running, so we're kind of back where we started. I guess you could make subinterpreters reference counted? There is no API for immediately deallocating a Python object; why should there be one for deallocating a subinterpreter?

(Or just don't have subinterpreters, that also solves the problem ;-).)

Don't optimize for the rare instant application termination desire in this case, just allow the "hang" to happen. If this turns out annoying to users, they can use strace or ltrace or their platform equivalents to see what's up and point the finger at "it's always DNS (or really the abstraction one layer up known as getaddrinfo logic)". If you need a program to exit immediately without waiting for any cleanup, that's what os._exit() is for. And at the command line that's what we're all trained to hit Ctrl-\ for when Ctrl-C doesn't satisfy our patience level.

These are good excuses if we want to make excuses, but... right now things just work? Why would we choose to de-optimize and force people to run strace or call _exit explicitly when we could just... not?

(Also FWIW I've been using Linux daily for almost 25 years, and I'm only vaguely aware of the existence of Ctrl-\ :-). I think you might overestimate how widespread that knowledge is...)

Jul 04 '21 20:07 njsmith

trio trio copied to clipboard

optional daemonic threads for trio.to_thread.run_sync

trio
trio copied to clipboard