A GC triggered while a thread is blocking on the GIL lock can freeze julia forever
Affects: PythonCall
Describe the bug
The current implementation of allowing more than one Julia thread to directly attempt to lock the GIL, which can block the entire julia OS thread if the GIL is already locked, can hang julia forever because GC will fail to reach a safepoint.
Here is a very simple reproduction, adapted ever so slightly from the documentation:
- The docs suggest to run this example: https://juliapy.github.io/PythonCall.jl/stable/pythoncall/#jl-multi-threading
- Here, we adapt it by only adding a call to trigger GC while the GIL is held:
julia> using PythonCall
julia> PythonCall.GIL.@unlock Threads.@threads for i in 1:4
PythonCall.GIL.@lock (GC.gc(); pyimport("time").sleep(5))
end
..........................
.... hanging forever .....
..........................
Your system
julia> versioninfo()
Julia Version 1.10.2+RAI
Commit 52dcbad168* (2025-03-27 18:26 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin24.3.0)
CPU: 12 × Apple M2 Max
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 4 default, 0 interactive, 4 GC (on 8 virtual cores)
Environment:
JULIA_SSL_CA_ROOTS_PATH =
julia> Pkg.status()
Status `~/.julia/environments/v1.10/Project.toml`
[69d22d85] About v1.0.2
[1520ce14] AbstractTrees v0.4.5
[6e4b80f9] BenchmarkTools v1.6.0
[a2441757] Coverage v1.6.1
⌅ [f68482b8] Cthulhu v2.16.5
[31a5f54b] Debugger v0.7.13
[ab62b9b5] DeepDiffs v1.2.0
[fb4d412d] FixedPointDecimals v0.6.3
[c27321d9] Glob v1.3.1
[92ed2492] HeapSnapshotUtils v0.1.0 `https://github.com/RelationalAI/HeapSnapshotUtils.jl#main`
[7ec9b9c5] Humanize v1.0.0
[5903a43b] Infiltrator v1.9.1
⌅ [70703baa] JuliaSyntax v0.4.10
[1fcbbee2] LookingGlass v0.3.3 `~/.julia/dev/LookingGlass`
[bdcacae8] LoopVectorization v0.12.172
[1914dd2f] MacroTools v0.5.16
[85b6ec6f] MethodAnalysis v0.4.13
[e4faabce] PProf v3.2.0
[14b8a8f1] PkgTemplates v0.7.56
[c46f51b8] ProfileView v1.10.1
[92933f4c] ProgressMeter v1.10.4
[9c30249a] RAI v0.2.9
[817f1d60] ReTestItems v1.31.0
[295af30f] Revise v3.8.0
[aa65fe97] SnoopCompile v3.0.2 `~/work/jl_depots/raicode2/dev/SnoopCompile`
[e2b509da] SnoopCompileCore v3.0.0 `~/.julia/dev/SnoopCompile/SnoopCompileCore`
[ac92255e] Speculator v0.2.0
[1e6cf692] TestEnv v1.102.1 `~/work/jl_depots/raicode3/dev/TestEnv`
[e689c965] Tracy v0.1.4
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`
I believe the correct solution is that we should adjust the lock and unlock functions to protect access to the GIL behind a Julia ReentrantLock, which is a cooperative lock and will cooperate with the runtime: https://github.com/JuliaPy/PythonCall.jl/blob/08157ccb94f94491854ceac139c20955849475e2/src/GIL/GIL.jl#L23-L30
This way, only one Julia OS thread will ever attempt to hold the GIL, since only one Task is allowed to proceed to call C.PyGILState_Ensure().
CC: @kpamnany, since you've fixed similar kinds of hangs in other packages. Does a global ReentrantLock to protect the GIL seem like the right fix to you?
I encountered this and implemented a ReentrantLock. It seems to have helped significantly.
See also https://stackoverflow.com/a/49202868/10313003
Yes, using a Julia lock to protect access to Python will prevent contention on the GIL.
I while ago I did some experimenting with combining Julia locks with the GIL, such as you are suggesting. While in this case adding a ReentrantLock to lock() and @lock will help, we'd also need to unlock it in unlock() and @unlock, but also we'd need it to get locked/unlocked appropriately when the GIL is locked/unlocked Python-side, which isn't possible.
If you don't do this, then you quickly get into scenarios where Julia is deadlocked waiting for the Julia lock because it (and the GIL) was locked Julia-side, then the GIL was released Python-side (but the Julia lock is still locked), then some Julia code tries to lock the GIL and the Julia lock, which is already locked.
Really I think we just want to be able to use the GIL like any other Julia lock, but there's no trylock-like function in the API for the GIL that we can use to test if the GIL is locked and yield to another task if not. All we can do is wait for the GIL, which blocks the thread entirely.
The GIL functionality in PythonCall is all pretty experimental at this stage. Buyer beware. So for now I think I'd rather leave the exposed functionality as simple as possible and leave extra handling like this up to the user as needed. If you have concrete, reliable ways to make this work better, please let me know. I think the only solution might be to ask Python to add PyGILState_TryEnsure().
I'll also note that this is all super flaky because Julia is allowed to migrate tasks between threads, so the original code is buggy anyway in that within @unlock, the GIL locking and the GIL unlocking might occur on different threads. Really you need to do this all within sticky tasks.
If you don't do this, then you quickly get into scenarios where Julia is deadlocked waiting for the Julia lock because it (and the GIL) was locked Julia-side, then the GIL was released Python-side (but the Julia lock is still locked), then some Julia code tries to lock the GIL and the Julia lock, which is already locked.
I'm not seeing how this can happen unless you're referring to a situation where Python calls back into Julia? Is this possible?
It might be misremembering the details but it was something like that. Julia can call into Python and Python can call into Julia, you can flip-flop as much as you like.
In any case, I did try a bunch of locking strategies and in all cases I could think of I could produce a deadlock quite easily.
The problem I encountered was when using TensorStore, which is mainly a C++ code base. They are also releasing the GIL on that side as well. I really should figure out direct bindings from Julia to C++ in that case.
That said adding a ReentrantLock did decrease the frequency of dreadlocks for me. I would be happy to submit that as a pull request so we can test it.
Yeah go for it.
Huh... So, someone else, who is not the code who locked the GIL, can unlock the GIL? That ... doesn't seem great.
If so, then yes the concerns you raised make a lot of sense. Thank you @cjdoris