lupa icon indicating copy to clipboard operation
lupa copied to clipboard

Deadlock involving __dealloc__

Open astoff opened this issue 2 years ago • 3 comments
trafficstars

I'm facing a deadlock which I think cannot be avoided from the Python API alone. The situation is as follows:

  1. A LuaRuntime is running on thread B.
  2. It calls a Python function. The argument is a Lua object.
  3. The Lua object gets diligently converted to a pure-Python data structure.
  4. The Python data structure is passed to thread A.
  5. Thread B blocks until thread A returns something.

I've traced the execution and the deadlock happens on this line of _LuaObject.__dealloc__:

https://github.com/scoder/lupa/blob/67ba19406d6587139fc2779f549e9543da68ff11/lupa/_lupa.pyx#L823

My assumption is that the "diligent conversion" step 3 generates some garbage that gets collected a bit later, when thread A is running. As evidence to this theory:

  • I haven't observed a deadlock yet if I call gc.collect(0) between steps 3 and 4.
  • I observe no deadlock if I remove the __dealloc__ method.

(As a data point for you, in case it is hard to find a fully general solution to this issue, my Lua runs are realtively short and I don't mind imperfect garbage collection as in the second point above.)

astoff avatar Nov 03 '23 20:11 astoff

Hmm, interesting case. That makes locking the runtime appear much less appealing inside of __dealloc__. It's difficult to avoid, though.

The lock that Lupa uses is re-entrant, so this rarely poses problems. It probably requires a setup as in your case to trigger it: passing Python objects over to other Lua threads inside of a Lua call.

One possible solution that comes to my mind is to request the lock in a non-blocking way in cases where we cannot afford waiting (in __dealloc__ specifically), and if we fail to acquire it, add the object to a list in the LuaRuntime and deallocate that list the next time right before we release the lock (i.e. when we definitely own it any way). That would probably keep some objects alive longer than necessary, and there might also be situations where this has other unexpected effects (we're dealing with threads here, so anything can happen, really), but it should still work as before in most cases and only behave differently in cases where it's currently dead-lock prone already. And probably provides an improvement in that case.

scoder avatar Nov 04 '23 11:11 scoder

I agree with what you wrote, and I can contribute a test case which currently fails for me:

def test_lupa_gc_deadlock():

    def assert_no_deadlock(thread):
        thread.start()
        thread.join(1)
        assert not thread.is_alive(), "thread didn't finish"

    def trigger_gc(ref):
        del ref[0]

    lua = LuaRuntime()
    ref = [lua.eval("{}")]
    lua.execute(
        "f,x=...; f(x)",
        assert_no_deadlock,
        Thread(target=trigger_gc, args=[ref]),
    )

astoff avatar Nov 06 '23 21:11 astoff

I started fixing this, but the test that you provided is still failing. Might just be a missing cleanup call somewhere.

See https://github.com/scoder/lupa/pull/255

scoder avatar Jan 28 '24 12:01 scoder