cpython icon indicating copy to clipboard operation
cpython copied to clipboard

gh-118727: Don't drop the GIL in `drop_gil()` unless the current thread holds it

Open swtaarrs opened this issue 1 year ago • 7 comments

drop_gil() assumes that its caller is attached, which means that the current thread holds the GIL if and only if the GIL is enabled, and the enabled-state of the GIL won't change. This isn't true, though, because detach_thread() calls _PyEval_ReleaseLock() after detaching and _PyThreadState_DeleteCurrent() calls it after removing the current thread from consideration for stop-the-world requests (effectively detaching it).

Fix this by remembering whether or not a thread acquired the GIL when it last attached, in PyThreadState._status.holds_gil, and check this in drop_gil() instead of gil->enabled. As part of this, add an explicit thread_dying argument to drop_gil(), separating that concept from whether or not it's safe to dereference tstate.

This fixes a crash in test_multiprocessing_pool_circular_import(), so I've reenabled it.

  • Issue: gh-118727

swtaarrs avatar May 07 '24 23:05 swtaarrs

I realized while writing up an explanation of my changes that the commit I just pushed isn't quite right. I changed drop_gil() to always take a non-NULL tstate by adding an explicit thread_dying argument. But in this case, tstate can be a dangling pointer so it's actually important to not dereference it. I'll come up with a different approach.

swtaarrs avatar May 08 '24 19:05 swtaarrs

This should be ready now - I updated the description to reflect the changes.

swtaarrs avatar May 08 '24 21:05 swtaarrs

ASAN failure looks like I need to move the write to holds_gil after the _PyThreadState_MustExit() check.

swtaarrs avatar May 08 '24 23:05 swtaarrs

The ASAN issue should be fixed. I also remembered that the only reason I split out drop_gil_impl() was so that I could call the meat of drop_gil() after the GIL had been disabled (looking back, that was obviously a sign that something wasn't right...). Since drop_gil() no longer cares if the GIL is enabled, I folded it back in, and _PyEval_DisableGIL() can now call drop_gil() normally.

swtaarrs avatar May 09 '24 00:05 swtaarrs

(moved from the issue where I accidentally posted this)

test_importlib got stuck in the tsan build again, which this should've fixed. I'll see what I can figure out.

swtaarrs avatar May 09 '24 23:05 swtaarrs

Does applying @suppress_immortalization() make a difference. I'm not suggesting that as a solution (there's probably a deeper issue to solve), but whether or not the decorator makes a difference might be helpful to know.

ericsnowcurrently avatar May 10 '24 16:05 ericsnowcurrently

@swtaarrs - I observed a hang when running pool_in_threads.py locally with this PR. (It may or may not be the same as the TSan hang).

I think there's an issue with FORCE_SWITCHING and the dynamic enabling of the GIL:

  • A thread disables (and releases) the GIL via _PyEval_DisableGIL. If _PY_GIL_DROP_REQUEST_BIT is set, it waits in the FORCE_SWITCHING section until another thread acquires the GIL
  • A second thread acquires the GIL. However, it sees that the GIL is now disabled and releases the GIL without notifying the first thread

colesbury avatar May 10 '24 18:05 colesbury

I think there's an issue with FORCE_SWITCHING and the dynamic enabling of the GIL

Thanks, good catch. I should probably pull drop_gil_impl() back out for use by _PyEval_DisableGIL() then. The FORCE_SWITCHING code doesn't apply once the GIL is disabled, and nothing else in drop_gil() applies to _PyEval_DisableGIL() either. I'll also take another look at everything at the end of take_gil() that's skipped in this case.

swtaarrs avatar May 13 '24 15:05 swtaarrs

The commit I just pushed should address the FORCE_SWITCHING issue by reverting my drop_gil_impl() change. I also added a line to clear _PY_GIL_DROP_REQUEST_BIT when disabling the GIL, so that thread doesn't keep hitting its eval_breaker once it resumes executing.

I'm running the tsan tests in a loop locally to see if it hangs again.

swtaarrs avatar May 13 '24 18:05 swtaarrs

@ericsnowcurrently - would you like to re-review this or should I go ahead and merge it?

colesbury avatar May 21 '24 22:05 colesbury

I'll take a quick look tomorrow.

ericsnowcurrently avatar May 21 '24 22:05 ericsnowcurrently

@ericsnowcurrently will you have time to take a look today?

swtaarrs avatar May 23 '24 18:05 swtaarrs

Sorry for the delay. I don't have any objections. If you both are confident about the solution then I'm on board. 😄

ericsnowcurrently avatar May 23 '24 18:05 ericsnowcurrently

Great! As a heads up, there is a different bug that affects the same test case (https://github.com/python/cpython/issues/119369).

colesbury avatar May 23 '24 18:05 colesbury

Thanks @swtaarrs for the PR, and @colesbury for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13. 🐍🍒⛏🤖

miss-islington-app[bot] avatar May 23 '24 20:05 miss-islington-app[bot]

GH-119474 is a backport of this pull request to the 3.13 branch.

bedevere-app[bot] avatar May 23 '24 20:05 bedevere-app[bot]