mod_wsgi icon indicating copy to clipboard operation
mod_wsgi copied to clipboard

FreeIPA + Python 3.9.8 causes deadlock

Open frenzymadness opened this issue 3 years ago • 33 comments

Hello there.

This issue probably does not belong here and I'm sorry for that but it's interesting and I'm running out of my debugging skills and you might be able to help me with that - and - I think that you experts love challenges.

I was describing my progress in https://bodhi.fedoraproject.org/updates/FEDORA-2021-e152ce5f31 but I'll try to summarize it here.

Short description: Combination of FreeIPA, httpd, mod_wsgi, and Python causes a deadlock in the threading module when systemd tries to stop httpd service - the process of stopping the httpd processes takes more than 90 seconds and all the processes are then killed.

The problem has been found by OpenQA tests in Fedora 34 when we tried to update Python from 3.9.7 to 3.9.8. I did a bisection of commits between these releases of Python and found out that the problem is caused/uncovered by this change: https://github.com/python/cpython/commit/94d19f606fa18a1c4d2faca1caf2f470a8ce6d46 I've discussed it with @vstinner and we think that the code in Python is correct - still, without this commit, the problem is not there but the code itself does look correct. There has to be something else.

Components:

  • Fedora 34
  • httpd 2.4.51 (latest version) installed from RPM package on Fedora 34
  • FreeIPA installed from RPM package the same way as it is in the tests, description here: https://openqa.fedoraproject.org/tests/1059611/modules/role_deploy_domain_controller/steps/1/src
  • mod_wsgi - installed from RPM - for testing updated to the latest version so it's newer that the version currently available in Fedora 34. I believed that some changes in the mod_wsgi might fix the issue but they don't.
  • Python - something between 3.9.7 and 3.9.8 - the reason is that I've found the commit "causing" the problem so I switch the versions.

How to reproduce it:

  1. Install FreeIPA with Python 3.9.8
  2. apachectl start
  3. Refresh web UI FreeIPA page a few times
  4. time apachectl stop

Result is:

# time apachectl stop

real	1m30.366s
user	0m0.023s
sys	0m0.031s

The log is available here: https://lbalhar.fedorapeople.org/error_log and it contains all the messages produced by httpd from start to end as described in the reproducer + some debug prints containing the file name where they come from. Note that there is no jump in the timestamps because httpd does not provide any more messages before it's killed. But the jump is in the systemd logs:

Nov 16 22:22:04 test.openqa.fedoraproject.org httpd[5648]: Server configured, listening on: port 443, port 80
Nov 16 22:22:18 test.openqa.fedoraproject.org systemd[1]: Stopping The Apache HTTP Server...
░░ Subject: A stop job for unit httpd.service has begun execution
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A stop job for unit httpd.service has begun execution.
░░ 
░░ The job identifier is 4975.
Nov 16 22:23:48 test.openqa.fedoraproject.org systemd[1]: httpd.service: State 'stop-sigterm' timed out. Killing.
Nov 16 22:23:48 test.openqa.fedoraproject.org systemd[1]: httpd.service: Killing process 5648 (httpd) with signal SIGKILL.
Nov 16 22:23:48 test.openqa.fedoraproject.org systemd[1]: httpd.service: Killing process 5651 (httpd) with signal SIGKILL.

From the error_log, you can see that there are multiple similar processes (5659, 5660, …) but only one of them (5660) served the requests from the web UI and the same process caused the deadlock. Only this one process reached the new branch (line 1457 https://github.com/python/cpython/blob/94d19f606fa18a1c4d2faca1caf2f470a8ce6d46/Lib/threading.py#L1457) and never reached the end of the _shutdown function.

I've also added faulthandler.dump_traceback_later(20, file=tempfile.mkstemp(dir="/var/log/httpd/")[0], repeat=True) to the _shutdown function and it produces only one file with some content and the content is:

Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown

Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>

Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown

Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>

Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown

Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>

Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>
Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown

Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>

Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>

So the one process deadlocks on lock.acquire(), line: https://github.com/python/cpython/blob/94d19f606fa18a1c4d2faca1caf2f470a8ce6d46/Lib/threading.py#L1470

From what I can see in the code of the threading module and the _shutdown function it seems that the process serving the requests from web UI is the only one where the _shutdown function is not called by the main thread. We have tried a blind fix - import threading as soon as possible (in site.py) and something similar is implemented here but it did not help.

I've tried to overload _thread.start_new_thread in threading.py but it produces no results so I think that threads are created in a different way here.

I've tried to use gdb to see what is happening in the locked process and there were two threads but one of them was caused by the faulthandler - when removed, there is a single thread with this stack:

#0  0x00007ff4f0f26a8a in __futex_abstimed_wait_common64 () from target:/lib64/libpthread.so.0
#1  0x00007ff4f0f22c98 in __new_sem_wait_slow64.constprop.0 () from target:/lib64/libpthread.so.0
#2  0x00007ff4efd2242c in PyThread_acquire_lock_timed () from target:/lib64/libpython3.9.so.1.0
#3  0x00007ff4efdae50c in acquire_timed () from target:/lib64/libpython3.9.so.1.0
#4  0x00007ff4efeabdab in lock_PyThread_acquire_lock () from target:/lib64/libpython3.9.so.1.0
#5  0x00007ff4efdb7a54 in method_vectorcall_VARARGS_KEYWORDS () from target:/lib64/libpython3.9.so.1.0
#6  0x00007ff4efce0231 in _PyEval_EvalFrameDefault () from target:/lib64/libpython3.9.so.1.0
#7  0x00007ff4efcd8e93 in function_code_fastcall () from target:/lib64/libpython3.9.so.1.0
#8  0x00007ff4f00472bb in ShutdownInterpreter_call () from target:/etc/httpd/modules/mod_wsgi_python3.so
#9  0x00007ff4efd2d336 in _PyObject_MakeTpCall () from target:/lib64/libpython3.9.so.1.0
#10 0x00007ff4efe03f76 in PyObject_VectorcallMethod () from target:/lib64/libpython3.9.so.1.0
#11 0x00007ff4efe4dd62 in wait_for_thread_shutdown () from target:/lib64/libpython3.9.so.1.0
#12 0x00007ff4efe7fb5f in Py_EndInterpreter () from target:/lib64/libpython3.9.so.1.0
#13 0x00007ff4f0045f2d in Interpreter_dealloc () from target:/etc/httpd/modules/mod_wsgi_python3.so
#14 0x00007ff4efcf0167 in free_keys_object () from target:/lib64/libpython3.9.so.1.0
#15 0x00007ff4f004e88c in wsgi_python_child_cleanup () from target:/etc/httpd/modules/mod_wsgi_python3.so
#16 0x00007ff4f0f5bcae in apr_pool_destroy () from target:/lib64/libapr-1.so.0
#17 0x00007ff4f052a52f in clean_child_exit () from target:/etc/httpd/modules/mod_mpm_event.so
#18 0x00007ff4f052c96a in child_main () from target:/etc/httpd/modules/mod_mpm_event.so
#19 0x00007ff4f052d29f in make_child () from target:/etc/httpd/modules/mod_mpm_event.so
#20 0x00007ff4f052d341 in startup_children () from target:/etc/httpd/modules/mod_mpm_event.so
#21 0x00007ff4f052e14f in event_run () from target:/etc/httpd/modules/mod_mpm_event.so
#22 0x000056514d371428 in ap_run_mpm ()
#23 0x000056514d35f484 in main ()

Any help is much appreciated. Cc @tiran @encukou

frenzymadness avatar Nov 16 '21 23:11 frenzymadness

For a start, don't use Python sub interpreters. I see the call to Py_EndInterpreter() which indicates you are. Sub interpreters can have issues with various third party modules which use C extensions which aren't implemented correctly so as to work in sub interpreters. Use of sub interpreters should thus always be avoided if possible. It looks like you might not also be using daemon mode, since Apache should kill off daemon processes forcibly after 5 seconds when being shutdown.

That said, what is the mod_wsgi configuration for Apache being used. Ideally you would be using mod_wsgi daemon mode, where that WSGI application is the only application running in that daemon process group, and use of the main interpreter context in that daemon process group is forced by setting the application group to %{GLOBAL}.

Also, does the WSGI application create its own background threads or use the atexit module to register cleanup actions, such as stopping background threads on process shutdown. Use of atexit is not recommended and no longer works in certain contexts under mod_wsgi in recent Python versions. To work around that mod_wsgi now has its own notification system for letting an application know it is being shutdown so it can stop background threads and avoid thread problems on shutdown.

Note that in future because of changes in how Python supports the use of sub interpreters, there is a chance that use of sub interpreters by mod_wsgi may have to be dropped, thus ensuring you are using a separate daemon process group and forcing the use of the main interpreter context is preferred as that is all that may be available in the future.

Anyway, will look in more detail when get a chance.

GrahamDumpleton avatar Nov 16 '21 23:11 GrahamDumpleton

To make one thing clear which may be relevant. All request handler threads in mod_wsgi are external threads, they are not Python threads, thus the Python interpreter would never be in a position of having to stop them. That the code is therefore ending up in Thread._wait_for_tstate_lock() which is called from Thread._stop() indicates that there must have been threads created by the WSGI application otherwise there would be no Python threads to stop.

My assumption therefore is that the problem is that the WSGI application is creating a background thread (possibly even forcibly marking it as non daemon, although changes to sub interpreters may mean all Python threads in sub interpreters are now created as non daemon regardless) but not doing anything to try and stop it when the process is being signalled to shutdown, or it tries to use atexit to stop it but due to potential issues with it, it never gets told to stop, although that should only be case if main interpreter is used, which you aren't. End result is that background thread is still running but interpreter shutdown blocks on waiting for it to stop, but the thread never does.

If you are able to somehow instrument things I would be dumping out a list of Python native threads still running, with stack traces if possible, so can work out where they originate. Dump out whether they are tagged as daemon or non daemon threads.

To stop the threads you can try and use atexit but that doesn't work in main interpreter context, as am suggesting you should use, as is called too late and I can't in mod_wsgi simulate what it does for main interpreter so shutdown earlier. In sub interpreters atexit isn't even triggered (don't think that has changed in new Python versions), but mod_wsgi does trigger atexit cleanup itself and is able to do that at a time where can actually be used to stop background threads.

Rather than try and rely on atexit which can't always work, it is better to use:

import mod_wsgi

def shutdown_background_threads(*args):
    ... do stuff to shutdown any application specific background threads

mod_wsgi.subscribe_shutdown(shutdown_background_threads)

Notes about mod_wsgi.subscribe_shutdown() can be found in:

  • https://modwsgi.readthedocs.io/en/master/release-notes/version-4.8.0.html#new-features

There is a chance that this has worked previously because all background threads would have been created as daemon threads and the interpreter shutdown wouldn't wait on them. I recollect some discussion about sub interpreters which meant that any background threads in sub interpreters would be forced to be non daemon threads even if explicitly set to be daemon. If this proposed change as I recollect it was made in Python 3.9 that would explain the problem as the sub interpreter would now wait on all background threads created in Python and block sub interpreter deletion. Thus it is now imperative that a WSGI application if it creates background threads, to subscribe to process shutdown and do whatever it can to ensure the background thread stops.

Part of the reason for expecting that sub interpreters in Python would become useless to mod_wsgi in the future was bound up in this change to make background threads in sub interpreters always non daemon and force sub interpreter to wait on them when deleted. I didn't realise this change ended up getting made if it did. One expected impact of the change was process hanging on process shutdown when sub interpreters were used and background threads were created as WSGI applications don't usually stop daemon threads, but now they would have to be changed to do so.

GrahamDumpleton avatar Nov 17 '21 00:11 GrahamDumpleton

Thanks a lot for the quick response. I have to analyze it because I haven't worked with sub interpreters yet. In the meantime, here is the source code of FreeIPA and its wsgi configuration: https://pagure.io/freeipa/blob/master/f/install/share/ipa.conf.template

frenzymadness avatar Nov 17 '21 00:11 frenzymadness

That configuration looks buggy to me. You have set:

WSGIProcessGroup ipa
WSGIApplicationGroup ipa

under:

<Location "/ipa/session/login_x509">

This means that for that one URL path you are running a separate instance of IPA.

That is, in the one daemon process group process you have two IPA instances running, one in the main interpreter context for the bulk of requests, and one just for handling requests sent to "/ipa/session/login_x509".

Not knowing if that was intended I would say you should not have:

WSGIProcessGroup ipa
WSGIApplicationGroup ipa

under that Location directive and would just rely on single instance in daemon process, using main interpreter context, as set by:

# Configure mod_wsgi handler for /ipa
WSGIDaemonProcess ipa processes=$WSGI_PROCESSES threads=1 maximum-requests=500 \
   user=ipaapi group=ipaapi display-name=%{GROUP} socket-timeout=2147483647 \
   lang=C.UTF-8 locale=C.UTF-8
 WSGIScriptAlias /ipa /usr/share/ipa/wsgi.py process-group=ipa \
   application-group=%{GLOBAL}

What doesn't make sense is that since daemon is otherwise being used for both instances, then the daemon process group processes shouldn't hang around for anything more than 5 seconds since the way Apache manages them (not under mod_wsgi control), it should forcibly kill off the processes if they don't exit themselves within 5 seconds.

Where the embedded mode instances are coming from may be because you are using:

AddHandler wsgi-script .py

In the Apache logs there is:

[Mon Nov 15 09:51:38.203979 2021] [wsgi:info] [pid 8379:tid 8379] mod_wsgi (pid=8379): Destroy interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.
[Mon Nov 15 09:51:38.204094 2021] [wsgi:info] [pid 8379:tid 8379] mod_wsgi (pid=8379): End interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.

Thus /ipa/wsgi/plugins.py is a distinct WSGI script and this is going to run in embedded mode and a sub interpreter.

So summarising issues as I see it.

  • You have two instances of IPA running in daemon process group, one in main interpreter context and one in ipa sub interpreter context. Not sure if this is intentional.

  • You have additional WSGI scripts mapped using AddHandler wsgi-script .py which are running in embedded mode and in a sub interpreter. It looks like one of those is creating a Python background thread, but not subscribing to process shutdown to somehow flag to the background thread that it should stop. The result is that if Python 3.9 is indeed now marking all background threads in sub interpreters as non daemon (foreground threads), they will block deletion of the Python sub interpreter and because that WSGI script is being handled in embedded mode, it doesn't get the benefit of the forced 5 second shutdown of daemon mode instances of WSGI applications.

My recommendations therefore are:

  • If you don't need two separate IPA instances (one for special URL path of /ipa/session/login_x509) then delete the WSGIProcessGroup and WSGIApplicationGroup under that Location directive.

  • If the standalone WSGI scripts mapped using AddHandler wsgi-script .py can run in sub interpreters of daemon process group used by IPA, then force them to run there by setting a default of:

    WSGIProcessGroup ipa
    

    outside of all Location and Directory directives after WSGIScriptAlias. Note, leave process-group on WSGIScriptAlias as setting both process-group and application-group on WSGIScriptAlias has a side effect of preloading the WSGI script on process start, which is why that has probably been done.

    By forcing the standalone WSGI scripts into the daemon process group, the process should get killed after 5 seconds even though thread shutdown hangs on deleting the sub interpreters.

  • Add code to the standalone WSGI scripts to register for mod_wsgi process shutdown event and flag any background thread to stop.

  • If the standalone WSGI scripts can't run in sub interpreter of same process as IPA instances, create a new daemon process group and delegate them to run in that instead.

GrahamDumpleton avatar Nov 17 '21 01:11 GrahamDumpleton

The relevant Python issue about changes to daemon threads in sub interpreters is:

  • https://bugs.python.org/issue40234

The outcome of that suggests that default behaviour would be the same and daemon threads are allowed.

If that is the case, then could only end up with blocking being seen if the Python thread was explicitly created as a non daemon thread, unless something is trying to shutdown daemon threads anyway, which is quite possible if ThreadPoolExecutor is attempting to do that now when sub interpreter shutdown. Either way, if the thread is not designed to shutdown and is stuck doing something, the whole sub interpreter deletion would hang.

GrahamDumpleton avatar Nov 17 '21 02:11 GrahamDumpleton

In sub interpreters atexit isn't even triggered (don't think that has changed in new Python versions)

Py_EndInterpreter() calls atexit callbacks since Python 3.8. The atexit state is now per-interpreter since Python 3.10.

vstinner avatar Nov 17 '21 12:11 vstinner

In the logs, I see 3 Python threads:

  • Thread A calls threading._shutdown()
  • Thread B: <no Python frame>
  • Thread C: <no Python frame>

It's strange to have two Python threads with no frame. What are these threads? Are they still running?

Timeout (0:00:20)!
Thread 0x00007f8aefae58c0 (most recent call first):
  File "/usr/lib64/python3.9/threading.py", line 1482 in _shutdown

Thread 0x00007f8ac67f4640 (most recent call first):
<no Python frame>

Thread 0x00007f8ac47f0640 (most recent call first):
<no Python frame>

vstinner avatar Nov 17 '21 12:11 vstinner

To stop the threads you can try and use atexit

atexit callbacks are called after threading._shutdown() in Py_Finalize() and Py_EndInterpreter().

vstinner avatar Nov 17 '21 12:11 vstinner

Oh, this is a fun bug. I'll apply Graham's suggestions and see if they help to address the issue.

tiran avatar Nov 18 '21 16:11 tiran

Oh, this is a fun bug. I'll apply Graham's suggestions and see if they help to address the issue.

Great, thanks. If you send me a PR or something I'll happily test it for you.

frenzymadness avatar Nov 18 '21 20:11 frenzymadness

I had a very interesting debug session with @frenzymadness who is able to reproduce the FreeIPA hang at exit in Apache mod_wsgi. I used gdb on Python 3.9. Random notes:

  • it happens in Py_EndInterpreter() : it's a subinterpreter

  • At the C level, the process has 2 threads:

    • (1) a Python thread calling threading._shutdown()
    • (2) the faulthandler C thread
  • interp.tstate_head list contains 3 Python threads:

    • thread (A) is the list head, is calling threading._shutdown(), thread. If I recall correctly, it was not tracked by the threading
    • thread (B) has on_delete=NULL and frame=NULL: it looks like a zombie thread, on_delete=NULL means that it's not tracked by the threading module
    • thread (C) has on_delete=<release_sentinel> and frame=NULL: on_delete=<release_sentinel> means that the thread is tracked the threading module
  • In short => the threading module is only tracking a single thread, and it's not the current thread.

  • When the bug triggers, if _main_thread.ident == get_ident(): is false in threading.py.

  • PyInterpreterState: interp.num_threads = 0: it means that the only the main thread is running, other Python threads already completed.

vstinner avatar Nov 19 '21 12:11 vstinner

I added debug traces in Python itself in functions like Py_NewInterpreter(), Py_EndInterpreter() and new_threadstate(). I modified Python to create tstate identifiers which are globally unique, to ease debugging, rather than being per-intepreter.

Simplified code:

  • mod_wsgi calls Py_NewInterpreter() which creates tstate 3

  • import threading is called in tstate 3, so threading._main_thread is tstate 3

  • new_threadstate() is called: tstate 5 is created

  • Py_EndInterpreter() is called on tstate 5

    • Py_EndInterpreter() calls threading._shutdown() which hangs
    • tstate 3 still exists, it may explain the issue
    • threading._main_thread is not the current thread, so _main_thread._stop() is not called in Python 3.9.8 because of the https://github.com/python/cpython/commit/38c67738c64304928c68d5c2bd78bbb01d979b94 change

In a Python sub-interpreter, it's really important that all tstate (Python thread state) are deleted by Py_EndInterpreter(), otherwise the function fails with a fatal error:

void
Py_EndInterpreter(PyThreadState *tstate)
{
    ...

    wait_for_thread_shutdown(tstate);

    _PyAtExit_Call(tstate->interp);

    if (tstate != interp->tstate_head || tstate->next != NULL) {
        Py_FatalError("not the last thread");
    }

    ...
}

Apache httpd mod_wsgi (simplified) logs of the sub-interpreter:

(1) Create a sub-interpreter, create tstate 3, import threading module in tstate 3

[pid 42790:tid 42906] ssl_engine_kernel.c(415): [client ::1:44912] AH02034: Subsequent (No.5) HTTPS request received for child 81 (server test.openqa.fedoraproject.org:443), referer: https://test.openqa.fedoraproject.org/ipa/ui/
[pid 42790] new_threadstate() -> interp=0x562f6d0fbcc0 tstate=0x562f6d0950e0 (tstate->id=3)
[pid 42790] Py_NewInterpreter() -> interp=0x562f6d0fbcc0 tstate=0x562f6d0950e0 (tstate->id=3)
[pid 42790:tid 42906] mod_wsgi (pid=42790): Create interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.
[pid 42790] threading.py: threading.py imported: _main_thread=<_MainThread(MainThread, started 140660598314560)>

(2) Create tstate 5, call Py_EndInterpreter() on tstate 5: tstate 3 still exists!

[pid 42790:tid 42790] event.c(569): wake up listener
[pid 42790:tid 42790] h2_workers.c(326): h2_workers: cleanup 2 workers idling
[pid 42790:tid 42790] h2_workers.c(356): h2_workers: cleanup all workers terminated
[pid 42790:tid 42790] h2_workers.c(359): h2_workers: cleanup zombie workers joined

[pid 42790] new_threadstate() -> interp=0x562f6d0fbcc0 tstate=0x562f6d0a7100 (tstate->id=5)
[pid 42790:tid 42790] mod_wsgi (pid=42790): Destroying interpreters.
[pid 42790:tid 42790] mod_wsgi (pid=42790): Destroy interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.
[pid 42790:tid 42790] mod_wsgi (pid=42790): End interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.
[pid 42790] Py_EndInterpreter(0x562f6d0a7100): interp=0x562f6d0fbcc0

(3) Py_EndInterprer() calls threading._shutdown(): _main_thread.ident is not equal to get_ident()

[pid 42790:tid 42790] [pid 42790] threading.py: threading.py - _shutdown function called
[pid 42790:tid 42790] [pid 42790] threading.py: _main_thread.ident == get_ident()? False
[pid 42790:tid 42790] [pid 42790] threading.py: lock.acquire() call to mimic Thread.join()

vstinner avatar Nov 19 '21 14:11 vstinner

Py_EndInterpreter() is called on tstate 5 Py_EndInterpreter() calls threading._shutdown() which hangs tstate 3 still exists, it may explain the issue

If I revert the threading._shutdown() change (https://github.com/python/cpython/commit/38c67738c64304928c68d5c2bd78bbb01d979b94), threading._shutdown() no longer hangs, and I see that tstate 3 and tstate 5 are deleted as expected by Py_EndInterpreter():

(1) Py_EndInterpreter() is called

[pid 43361:tid 43361] mod_wsgi (pid=43361): End interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.
[pid 43361] Py_EndInterpreter(0x55e3aabfb100): interp=0x7fdb84001420

[pid 43361:tid 43361] [pid 43361] threading.py: threading.py - _shutdown function called
[pid 43361:tid 43361] [pid 43361] threading.py: _main_thread.ident == get_ident()? False
[pid 43361:tid 43361] [pid 43361] threading.py: CASE3: _main_thread._tstate_lock is locked (regular code path)
[pid 43361:tid 43361] [pid 43361] threading.py: before _shutdown_locks.clear() call
[pid 43361:tid 43361] [pid 43361] threading.py: exit _shutdown()

[pid 43361] tstate_delete_common(0x7fdb84002a40) (tstate->id=3)
[pid 43361] tstate_delete_common(0x55e3aabfb100) (tstate->id=5)
[pid 43361:tid 43361] mod_wsgi (pid=43361): Cleanup interpreter ''.

(1) Py_Finalized() is called (I guess)

[pid 43361:tid 43361] mod_wsgi (pid=43361): Terminating Python.
[pid 43361:tid 43361] [pid 43361] threading.py: threading.py - _shutdown function called
[pid 43361:tid 43361] [pid 43361] threading.py: _main_thread.ident == get_ident()? True
[pid 43361:tid 43361] [pid 43361] threading.py: CASE3: _main_thread._tstate_lock is locked (regular code path)
[pid 43361:tid 43361] [pid 43361] threading.py: before _shutdown_locks.clear() call
[pid 43361:tid 43361] [pid 43361] threading.py: exit _shutdown()

[pid 43361] _PyThreadState_DeleteExcept(0x55e3aaabb920) (tstate->id=1)
[pid 43361] tstate_delete_common(0x55e3aaabb920) (tstate->id=1)
[pid 43361:tid 43361] mod_wsgi (pid=43361): Python has shutdown.

vstinner avatar Nov 19 '21 14:11 vstinner

Logs with mod_wsgi debug traces: WSGIVerboseDebugging On.

  • Py_NewInterpreter() is called in Apache thread 1 with Python tstate 3 (0x7fb35c004840)
  • Py_EndInterpreter() is called in Apache thread 0 with Python tstate 5 (0x561e2d20f100), while the Python tstate 3 still exists
(1) Py_NewInterpreter() called by thread 44535: create tstate 3 (0x7fb35c004840) binded to Apache thread 1

[pid 44364] new_threadstate() -> interp=0x7fb35c003220 tstate=0x7fb35c004840 (tstate->id=3)
[pid 44364] Py_NewInterpreter() -> interp=0x7fb35c003220 tstate=0x7fb35c004840 (tstate->id=3)
[pid 44364:tid 44535] mod_wsgi (pid=44364): Create interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.
[pid 44364:tid 44535] src/server/wsgi_interp.c(1549): mod_wsgi (pid=44364): Bind thread state for thread 1 against interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.

(1) Py_EndInterpreter() called by thread 44364: create tstate 5 (0x561e2d20f100) binded to Apache thread 0

[pid 44364] new_threadstate() -> interp=0x7fb35c003220 tstate=0x561e2d20f100 (tstate->id=5)
[pid 44364:tid 44364] src/server/wsgi_interp.c(2620): mod_wsgi (pid=44364): Create thread state for thread 0 against interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.
[pid 44364:tid 44364] mod_wsgi (pid=44364): Destroying interpreters.
[pid 44364:tid 44364] mod_wsgi (pid=44364): Destroy interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.
[pid 44364:tid 44364] mod_wsgi (pid=44364): End interpreter 'test.openqa.fedoraproject.org|/ipa/wsgi/plugins.py'.
[pid 44364] Py_EndInterpreter(0x561e2d20f100): interp=0x7fb35c003220
[pid 44364:tid 44364] mod_wsgi (pid=44364): Cleanup interpreter ''. 

[pid 44364:tid 44364] mod_wsgi (pid=44364): Terminating Python.
[pid 44364:tid 44364] mod_wsgi (pid=44364): Python has shutdown.

vstinner avatar Nov 19 '21 15:11 vstinner

The threading._shutdown() now waits until a Python thread completes at the C level: when the thread_bootstrap() function (previously known as t_bootstrap()) completes. It changed to fix a race condition, see:

  • https://github.com/python/cpython/commit/468e5fec8a2f534f1685d59da3ca4fad425c38dd
  • https://bugs.python.org/issue36402

This change was made in 2019. Recently (Python 3.9.8), threading._shutdown() became stricter:

  • https://github.com/python/cpython/commit/38c67738c64304928c68d5c2bd78bbb01d979b94
  • https://bugs.python.org/issue1596321

Maybe FreeIPA still relies on the old Python threading behavior: consider that a thread completes when it completed at the Python level.

When using subinterpreters, it's really important that a thread completed at the C level: the new Python threading behavior. Otherwise, Py_EndInterpreter() fails with a Python fatal error, as I explained previously.

It's important to delete most threads before threading._shutdown() is called, otherwise threading._shutdown() can hang even if the thread is considered as "completed" at the Python level. The "tstate" must be deleted.


It seems like mod_wsgi now has a mechanism to delete all Python tstate by registering itself in atexit: the C function ShutdownInterpreter_call() deletes all tstate with:

        /* Delete remaining thread states. */

        PyThreadState_Swap(NULL);

        tstate = PyInterpreterState_ThreadHead(tstate->interp);

        while (tstate) {
            tstate_next = PyThreadState_Next(tstate);
            if (tstate != tstate_save) {
                PyThreadState_Swap(tstate);
                PyThreadState_Clear(tstate);
                PyThreadState_Swap(NULL);
                PyThreadState_Delete(tstate);
            }
            tstate = tstate_next;
        }
        tstate = tstate_save;

        PyThreadState_Swap(tstate);

Maybe it's too late: atexit callbacks are only called after threading._shutdown().

vstinner avatar Nov 19 '21 15:11 vstinner

For sub interpreters, mod_wsgi triggers atexit() callbacks itself earlier before interpreter deletion. But then, how atexit callbacks work has changed many times over the years between Python versions, so how I do that may not be working anymore and a new hack is required to workaround Python C APIs not providing a way for embedded systems to do it when they need to.

GrahamDumpleton avatar Nov 19 '21 23:11 GrahamDumpleton

So looks like the hack was to replace threading._shutdown(). See:

https://github.com/GrahamDumpleton/mod_wsgi/blob/develop/src/server/wsgi_interp.c#L519 https://github.com/GrahamDumpleton/mod_wsgi/blob/develop/src/server/wsgi_interp.c#L211

But the order was that I would only call atexit() after calling original function I was wrapping. That seems to now be the source of the problems if threading._shutdown() has changed.

Probably take me a few hours to get my head around this new mess. :-(

GrahamDumpleton avatar Nov 19 '21 23:11 GrahamDumpleton

I think the only solution to all of this is for mod_wsgi to not bother trying to delete sub interpreters or shutdown the main Python interpreter. Instead mod_wsgi would just call C exit() and bypass all that. For cleanup actions on process shutdown people would have to use the mod_wsgi mechanism for getting notification of process shutdown instead of atexit().

FWIW, from memory uWSGI never tried to stop the Python interpreter properly and always just called C exit(), thus you were always forced in uWSGI to use its own process shutdown notification, where as mod_wsgi tried to rely on atexit(), even though that meant in older Python versions that for sub interpreters mod_wsgi had to trigger atexit() callbacks itself since sub interpreters didn't do that.

As things progress it just appears that it is going to be too hard to combine external C threads with sub interpreters and sub interpreter deletion because of how the use case for sub interpreters in CPython is changing.

GrahamDumpleton avatar Nov 20 '21 01:11 GrahamDumpleton

Latest develop version of mod_wsgi in GitHub no longer attempts to destroy Python sub interpreters or main Python interpreter on shutdown and will instead just let the process exit without attempting any cleanup.

pip install --no-cache-dir -U https://github.com/GrahamDumpleton/mod_wsgi/archive/refs/heads/develop.zip

Details in:

  • https://github.com/GrahamDumpleton/mod_wsgi/blob/develop/docs/release-notes/version-4.9.1.rst

I can't see any way of being able to safely delete Python thread state objects before destroying the Python sub interpreters if that is what is now required as can't absolutely guarantee the threads using those thread state objects aren't still running. This is the case whether they be externally create C threads used by mod_wsgi to handle requests, or background daemon threads created by a WSGI application.

Yes this meant that technically if such a thread became running when the interpreters were being destroyed that the process could crash, but this was deemed acceptable compared to having the process hang as could now occur.

GrahamDumpleton avatar Nov 20 '21 23:11 GrahamDumpleton

Hi @GrahamDumpleton, thanks for looking into the matter!

Is it safe to omit the interpreter shutdown? As far as I understand the system, Python also runs finalizers (try/finally, __exit__, __del__, weakref hooks) during the interpreter shutdown phase. The change may cause issues like not flushing buffered writes to file.

By the way, there is an unofficial API to run all atexit hooks manually:

import atexit
import mod_wsgi

def exit_handler(event, **kwargs):
  atexit._run_exitfuncs()

mod_wsgi.subscribe_shutdown(exit_handler)

tiran avatar Nov 21 '21 18:11 tiran

I don't know whether has always existed as an issue with atexit._run_exitfuncs(), only older implementations, or maybe was only an issue when one had to use sys.exitfunc(), but at one point I recollect this didn't deregister callbacks once called, meaning it wasn't really safe to call more than one.

That said, I have been thinking whether in new scheme whereby skip destroying interpreter whether should still call this. Even when destroying sub interpreters I have been calling it since prior to 3.7 it wasn't called for sub interpreters by Python core. I think from 3.7 at least it would deregister the callbacks once called, so it being called more than once since 3.7, since I didn't stop calling it, hasn't caused any issues that know of. I am breaking the notional contract though that supposedly Python applies to it, which is only called after non daemon threads are stopped.

As to flushing buffers, C exit() will at least flush output buffers for C libraries, so would only be an issue where Python is buffering itself. For any other Python level constructs, nothing I can do.

Anyway, can't see any other viable options since I am not convinced I can safely cleanup the thread states for request handler threads before destroying the sub interpreter. Plus there could be daemon threads still running which I can't cleanup anyway and that will still result in things hanging if destroying the sub interpreter is going to hang if any thread states still exist as it seems is occurring. Maybe that restriction should only have been getting applied if the sub interpreter was created with that option which prohibits daemon threads.

GrahamDumpleton avatar Nov 21 '21 21:11 GrahamDumpleton

Some more issues to complicate this. From mod_wsgi code:

    /*
     * In Python 2.5.1 an exit function is no longer used to
     * shutdown and wait on non daemon threads which were created
     * from Python code. Instead, in Py_Main() it explicitly
     * calls 'threading._shutdown()'. Thus need to emulate this
     * behaviour for those versions.
     */

#if PY_MAJOR_VERSION < 3 || (PY_MAJOR_VERSION == 3 && PY_MINOR_VERSION < 4)

I think the version in the comment here is wrong, but then the conditional is also confusing as it suggests prior to Python 3.4, so shouldn't have been 3.5.1 unless the conditional was wrong and was supposed to be <= 4. Either way, this seems to indicate that Python 2.7 relies on atexit to wait on non daemon threads. So for Python 2.7 invoking atexit functions would cause waiting on the non daemon threads where it doesn't for supported Python 3.X versions.

And then:

    /*
     * Invoke exit functions by calling sys.exitfunc() for
     * Python 2.X and atexit._run_exitfuncs() for Python 3.X.
     * Note that in Python 3.X we can't call this on main Python
     * interpreter as for Python 3.X it doesn't deregister
     * functions as called, so have no choice but to rely on
     * Py_Finalize() to do it for the main interpreter. Now
     * that simplified GIL state API usage sorted out, this
     * should be okay.
     */

This confirms an earlier comment that after atexit registered callbacks were called they weren't deregistered for some Python 3.X versions, means that one could trigger them twice if called atexit prior to destroying the interpreter. So would need to know which versions that occurs for and see whether is only now for unsupported Python 3.X versions.

Have also found other things in mod_wsgi code which is confusing and can't remember the exact history of. Namely there are actually two places where thread states for sub interpreters are deleted.

One is immediately before destroying the sub interpreter.

    /* If we own it, we destroy it. */

    if (self->owner) {
        /*
         * We need to destroy all the thread state objects
         * associated with the interpreter. If there are
         * background threads that were created then this
         * may well cause them to crash the next time they
         * try to run. Only saving grace is that we are
         * trying to shutdown the process.
         */

#if PY_MAJOR_VERSION < 3 || (PY_MAJOR_VERSION == 3 && PY_MINOR_VERSION < 4)

but this is only done for Python 2.7, something changed during Python 3.X versions. The comment does warn though about concerns about process crashing as a result of doing this.

The other place where delete thread states is after had called threading._shutdown() and atexit.run_exitfuncs() when started doing that explicitly for Python sub interpreters.

Anyway, you can hopefully appreciate how messy things are with atexit and thread shutdown over time with different Python versions and why this have never given any joy in trying to handle it all. I would have to go through and look really close at what all supported Python versions do and see if is a sane way to handle it all, perhaps deleting some code which goes back to very early Python 2.X versions so can better see what is going on. Also, way back in time there was an attempt to actually delete Python sub interpreters and recreate them as a way of reloading an application without stopping the process. Sub interpreters proved to leak too many resources and many C extensions would break if this was done so disabled that feature, but the code was never refactored to eliminate the mess around modelling sub interpreters as referenced counted resources to managed deletion. Since don't need that, things could be ever further simplified, but is still a lot of rewriting.

GrahamDumpleton avatar Nov 22 '21 09:11 GrahamDumpleton

FWIW, these are all the Python versions mod_wsgi possibly still works with:

Python :: 2.6
Python :: 2.7
Python :: 3.3
Python :: 3.4
Python :: 3.5
Python :: 3.6
Python :: 3.7
Python :: 3.8
Python :: 3.9
Python :: 3.10

I don't know that want to drop Python 2.7 just yet, but could drop Python 2.6 and Python 3.3 through 3.5, presuming some of those old versions do actually still work. Not tested them, but code in mod_wsgi hasn't really changed much last few years where it would matter so they could actually still work.

GrahamDumpleton avatar Nov 22 '21 09:11 GrahamDumpleton

Reviewing code.

Python 2.7 atexit calling is pure Python.

  • https://github.com/python/cpython/blob/v2.7.18/Lib/atexit.py#L22

It pops handlers off the list when calling so they are thus removed.

Python 3.6 atexit calling is C code. In head of 3.6, it does clear atexit callbacks list once called.

  • https://github.com/python/cpython/blob/3.6/Modules/atexitmodule.c#L108

Going back further even Python 3.0 did it, so comment above about list not being cleared is possibly related to early dev or release candidate versions of Python 3.0. Maybe I raised the issue at the time and implementation was changed.

As to atexit being use to wait for threads to shutdown, that looks to have changed at some point in the lifetime of Python 2.7.

https://github.com/python/cpython/blob/v2.7.18/Python/pythonrun.c#L419-L430

This code says:

    wait_for_thread_shutdown();

    /* The interpreter is still entirely intact at this point, and the
     * exit funcs may be relying on that.  In particular, if some thread
     * or exit func is still waiting to do an import, the import machinery
     * expects Py_IsInitialized() to return true.  So don't say the
     * interpreter is uninitialized until after the exit funcs have run.
     * Note that Threading.py uses an exit func to do a join on all the
     * threads created thru it, so this also protects pending imports in
     * the threads created via Threading.
     */
    call_sys_exitfunc();

The comment is actually wrong and threading.py does not use atexit in head of Python 2.7, instead wait_for_thread_shutdown() does that just beforehand.

In fact the use of wait_for_thread_shutdown() was added in Python 2.6.5.

https://github.com/python/cpython/blob/v2.6.5/Python/pythonrun.c#L392

I can't remember the overlap between Python 2.6 and 3.X versions but possibly when this change was done in 3.X it was back ported to 2.6 for consistency. Means that is safe to assume that with versions of Python that will be used, that atexit isn't going to wait on threads to shutdown even if support Python 2.7. That makes things easier.

The problem is still that threading._shutdown() expects thread states to be deleted which still don't really understand how is enforced. It is strange that with this behaviour that the function is even public within threading module and not some hidden C function. Anyway, still can't attempt to call it.

Best could possibly do would be as follows.

  • Trigger mod_wsgi shutdown notification event and trigger registered handlers for it.
  • Call atexit handlers for threads where Python supports it.
  • Run sys.__run_exitfuncs().
  • Exit process without destroying interpreters.

GrahamDumpleton avatar Nov 22 '21 09:11 GrahamDumpleton

In contrast, summarising what was being done since have better understanding now, ignoring funny variations for old Python versions, we have:

  • Trigger mod_wsgi shutdown notification event and trigger registered handlers for it.
  • Run sys._run_exitfuncs() if is a sub interpreter (before non daemon thread shutdown as had no choice for Python <3.7 since atexit wasn't called for sub interpreters before 3.7).
  • Delete mod_wsgi thread states for external C threads used for request handlers (except the main thread).
  • Delete sub interpreters, which results in non daemon thread shutdown (but not daemon threads).
  • Delete main interpreters.

Which as we are seeing potentially results in the following.

  • If a daemon thread is not stopped using mod_wsgi shutdown handler or threading atexit callback then in 3.9+ process will hang on shutdown when threading._shutdown() is called.

That sys._run_exitfuncs() was only called for sub interpreters meant there was an inconsistency with behaviour compared to the Python main interpreter. For historical reasons (atexit callbacks were not originally deregistered when called in very early Python 3.0 alpha/beta versions) it wasn't done for that case, but was never revisited when Python 3.0 fixed issue.

What I may do at this point is try and throw out all the old crufty code which has accrued due to supporting so many variations of things in very old Python versions, thus simplifying the mod_wsgi code. I'll then try and verify again the expectation that problem is caused by daemon threads which aren't stopped. I'll have to look more closely at what is happening with the thread state for the main thread as well, since one would be created for that (in addition to request handler threads), when WSGI script preloading occurs.

GrahamDumpleton avatar Nov 22 '21 10:11 GrahamDumpleton

The Python finalization is a complex task. I took some notes about recent changes: https://pythondev.readthedocs.io/finalization.html

You're free to call os._exit() to skip any kind of cleanup: simple and fast. The risk is if an application relies on Python finalizer methods (__del__()) to flush some buffers on disk or to advertize somehow clients that the server is shutting down. It's hard to say in general what are applications expectations when a WSGI server restarts.

About threads which can hang at exit, there are different cases:

  • (A) Regular Python thread which hangs for whatever reason
  • (B) Python thread state which still exists in memory but doesn't run any code
  • (C) Daemon thread

For case (A), you can use a Python debugger, faulthandler, gdb, whatever you want to see what's going on.

mod_wsgi seems to be affected by case (B) where Python thread states are still attached to Apache threads at Python exit, even if these threads are idle (don't run any Python code). See my debug notes in this comment: https://github.com/GrahamDumpleton/mod_wsgi/issues/730#issuecomment-974155542 To debug that, you can iterate on interp->tstate_state.

For case (C), daemon threads, they are ignored by threading._shutdown() and the Python finalization is now safe for the main interpreter (*). For sub interpreters, using daemon threads is dangerous and should be avoided since threading._shutdown() don't wait for them, whereas Py_EndInterpreter() fails with a fatal error if daemon threads still exist at exit...

(*) About "safe": daemon threads still running at Python exit call pthread_exit() which can cause new problems in some specific cases (Python embedded in a C++ application): https://bugs.python.org/issue42969 ...

The Python finalization problem is not specific to sub-interpreters. If a Python thread never stops, case (A), a "regular Python" also hangs at exit. It's the expected behavior. By the way, there is no way to kill a thread on a portable way.

There is no silver bullet. What can help is to have better documentation and explain how to debug these issues.

vstinner avatar Nov 22 '21 16:11 vstinner

To make clear one point. Am not calling os._exit() and no intention to. That actually calls C _exit() which doesn't result in C level i/o buffers to be flushed and whatever else it doesn't do.

For mod_wsgi daemon mode I would end up calling C exit() explicitly as mod_wsgi controls process lifecycle. For embedded mode mod_wsgi doesn't control process lifecycle and as such can only skip interpreter deletion and return back to Apache module lifecycle management code, with Apache then calling C exit() at some point, or any other mechanism it may use, such as dropping out bottom of main().

GrahamDumpleton avatar Nov 22 '21 19:11 GrahamDumpleton

Can someone summarise the outcome on this given I see a change was made in CPython to revert some change made in it?

I need to determine if I revert completely the change I made in mod_wsgi develop branch so that still attempt to destroy the interpreters on process shutdown, or at the least make the option opposite and be "skip destruction" instead . In other words preserve old behaviour but allow one to opt into not destroying interpreters on process shutdown.

GrahamDumpleton avatar Jan 04 '22 00:01 GrahamDumpleton

Since have no idea what final outcome of this is, I reverted the change in mod_wsgi which would mean the interpreter is not deleted on process shutdown. The same behaviour as before is kept by default meaning it will attempt to delete the interpreter. I flipped the default for the WSGIDestroyInterpreter direct to default to On, so it can be turned off if set to Off. Details in:

  • https://modwsgi.readthedocs.io/en/master/release-notes/version-4.9.1.html#features-changed

GrahamDumpleton avatar May 02 '22 06:05 GrahamDumpleton

It seems the problem still exists with mod_wsgi 4.9.1 and Python 3.11.

What we've been doing in Fedora since this issue first appeared is reverting the upstream commit that we identified as causing this. We have that reversion in our Python 3.9 packages and our Python 3.10 packages, which were the current versions in all Fedora branches till recently.

A couple of days ago we landed Python 3.11 in Fedora Rawhide. Our Python 3.11 package doesn't have that reversion any more (I'm not sure why it's been dropped, will follow that up with our Python maintainers), and this issue has showed up again. So it's been there all along, really, and it bit us again as soon as we dropped the reversion.

AdamWill avatar Jun 23 '22 19:06 AdamWill

I'm not sure why it's been dropped, will follow that up with our Python maintainers

Because I assumed this has been solved in mod_wsgi 4.9.1.

hroncok avatar Jun 23 '22 23:06 hroncok

Nope. It was made an optional change in 4.9.1 because although I asked on this discussion above for an update, no one provided one so did not feel at all comfortable making the change permanent.

You can set:

WSGIDestroyInterpreter Off

if you want the behaviour of not attempting to destroy Python on process shutdown, with the understanding that this may cause problems. More details were in the release notes.

  • https://modwsgi.readthedocs.io/en/master/release-notes/version-4.9.1.html#features-changed

GrahamDumpleton avatar Jun 23 '22 23:06 GrahamDumpleton

I should add that without a proper explanation as I was seeking about the issues/changes in CPython, to me it looks like the change is a breaking change in CPython that will affect any embedded system making use of Python sub interpreters, if not any use of multithreading by way of external C threads.

GrahamDumpleton avatar Jun 23 '22 23:06 GrahamDumpleton