gh-118727: Don't drop the GIL in `drop_gil()` unless the current thread holds it
drop_gil() assumes that its caller is attached, which means that the current thread holds the GIL if and only if the GIL is enabled, and the enabled-state of the GIL won't change. This isn't true, though, because detach_thread() calls _PyEval_ReleaseLock() after detaching and _PyThreadState_DeleteCurrent() calls it after removing the current thread from consideration for stop-the-world requests (effectively detaching it).
Fix this by remembering whether or not a thread acquired the GIL when it last attached, in PyThreadState._status.holds_gil, and check this in drop_gil() instead of gil->enabled. As part of this, add an explicit thread_dying argument to drop_gil(), separating that concept from whether or not it's safe to dereference tstate.
This fixes a crash in test_multiprocessing_pool_circular_import(), so I've reenabled it.
- Issue: gh-118727
I realized while writing up an explanation of my changes that the commit I just pushed isn't quite right. I changed drop_gil() to always take a non-NULL tstate by adding an explicit thread_dying argument. But in this case, tstate can be a dangling pointer so it's actually important to not dereference it. I'll come up with a different approach.
This should be ready now - I updated the description to reflect the changes.
ASAN failure looks like I need to move the write to holds_gil after the _PyThreadState_MustExit() check.
The ASAN issue should be fixed. I also remembered that the only reason I split out drop_gil_impl() was so that I could call the meat of drop_gil() after the GIL had been disabled (looking back, that was obviously a sign that something wasn't right...). Since drop_gil() no longer cares if the GIL is enabled, I folded it back in, and _PyEval_DisableGIL() can now call drop_gil() normally.
(moved from the issue where I accidentally posted this)
test_importlib got stuck in the tsan build again, which this should've fixed. I'll see what I can figure out.
Does applying @suppress_immortalization() make a difference. I'm not suggesting that as a solution (there's probably a deeper issue to solve), but whether or not the decorator makes a difference might be helpful to know.
@swtaarrs - I observed a hang when running pool_in_threads.py locally with this PR. (It may or may not be the same as the TSan hang).
I think there's an issue with FORCE_SWITCHING and the dynamic enabling of the GIL:
- A thread disables (and releases) the GIL via
_PyEval_DisableGIL. If_PY_GIL_DROP_REQUEST_BITis set, it waits in theFORCE_SWITCHINGsection until another thread acquires the GIL - A second thread acquires the GIL. However, it sees that the GIL is now disabled and releases the GIL without notifying the first thread
I think there's an issue with
FORCE_SWITCHINGand the dynamic enabling of the GIL
Thanks, good catch. I should probably pull drop_gil_impl() back out for use by _PyEval_DisableGIL() then. The FORCE_SWITCHING code doesn't apply once the GIL is disabled, and nothing else in drop_gil() applies to _PyEval_DisableGIL() either. I'll also take another look at everything at the end of take_gil() that's skipped in this case.
The commit I just pushed should address the FORCE_SWITCHING issue by reverting my drop_gil_impl() change. I also added a line to clear _PY_GIL_DROP_REQUEST_BIT when disabling the GIL, so that thread doesn't keep hitting its eval_breaker once it resumes executing.
I'm running the tsan tests in a loop locally to see if it hangs again.
@ericsnowcurrently - would you like to re-review this or should I go ahead and merge it?
I'll take a quick look tomorrow.
@ericsnowcurrently will you have time to take a look today?
Sorry for the delay. I don't have any objections. If you both are confident about the solution then I'm on board. 😄
Great! As a heads up, there is a different bug that affects the same test case (https://github.com/python/cpython/issues/119369).
Thanks @swtaarrs for the PR, and @colesbury for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13. 🐍🍒⛏🤖
GH-119474 is a backport of this pull request to the 3.13 branch.