cpython icon indicating copy to clipboard operation
cpython copied to clipboard

gh-120321: Lock gen_send

Open Fidget-Spinner opened this issue 1 year ago • 9 comments

Adds per-object lock for gen_send to fix crashes.

  • Issue: gh-120321

Fidget-Spinner avatar Jun 10 '24 18:06 Fidget-Spinner

What's the root cause of https://github.com/python/cpython/issues/120321? Until you know that, how do you know this fixes the problem?

This feels like slapping a lock on something and hoping it works.

markshannon avatar Jun 11 '24 11:06 markshannon

What's the root cause of #120321? Until you know that, how do you know this fixes the problem?

The crash's root cause is here https://github.com/python/cpython/pull/120327/files#diff-2a0c449b68605ebd0872fd232e60ce7e838a77782a6d2e364764f99514fb508aR219

Those 3 lines can race and cause tstate->exc_info along with gi_exc_state to become invalid.

I locked the whole thing because TSAN is warning not just those few lines. It warns that the _PyFrame_StackPush and all the other stack accessor functions in there are racing as well. Though I don't think that results in a crash.

Fidget-Spinner avatar Jun 11 '24 11:06 Fidget-Spinner

What about all the other places that tstate->exc_info is updated when we push or pop a generator frame?

I suspect that this doesn't crash only because specialization is turned off for free-threading:

def gen():
    while True:
        yield

def my_next(it):
    for val in it:
        return value
    raise StopIteration

it = gen()
with concurrent.futures.ThreadPoolExecutor() as executor:
    while True:
        _ = executor.submit(lambda: my_next(it))

markshannon avatar Jun 11 '24 11:06 markshannon

I suspect that this doesn't crash only because specialization is turned off for free-threading:

Yeah the entire gen_send has multiple races in multiple places. Likewise for the specializing interpreter. So the entire thing needs to be locked (what the GIL does at the moment).

Fidget-Spinner avatar Jun 11 '24 12:06 Fidget-Spinner

gen_send isn't the problem. It contains an instance of the problematic code, but there are instances in bytecodes.c in _SEND and _FOR_ITER_GEN`.

Pushing and popping generator frames needs to be locked. The best way to do that would be to move popping and pushing into its own helper function.

markshannon avatar Jun 11 '24 13:06 markshannon

I agree that pushing and popping frames need fixing, but gen_send is still problematic even with that. It has non-atomic accesses of fields on tstate and the frame which another thread could be concurrently modifying. In fact even after factoring out the generator frame pushing and locking just that, I still get crashes from the other issues in gen send.

Fidget-Spinner avatar Jun 11 '24 13:06 Fidget-Spinner

OOI, what are the other crashes?

markshannon avatar Jun 11 '24 13:06 markshannon

A selection of them after I've locked just the frame pushing

python: Objects/genobject.c:235: gen_send_ex2_unlocked: Assertion `gen->gi_exc_state.previous_item == NULL' failed

./Include/internal/pycore_frame.h:313: _PyFrame_GetGenerator: Assertion `frame->owner == FRAME_OWNED_BY_GENERATOR' failed.

./Include/internal/pycore_frame.h:77: _PyFrame_GetCode: Assertion `PyCode_Check(f->f_executable)' failed.

Fidget-Spinner avatar Jun 12 '24 14:06 Fidget-Spinner

I wonder if we really only need the locks up to the transition to gen->gi_frame_state = FRAME_EXECUTING;. That would seem to work better with fixing SEND in the opcode as well as doing the unlocking there isn't going to be as easy.

I think other than the asserts that only leaves if (FRAME_STATE_SUSPENDED(gen->gi_frame_state)) { as the lone racing read. I'm not sure that this is fully really safe even w/ the GIL though. It seems like _PyEval_EvalFrameDefault wouldn't release the GIL between yielding a value and getting back to here, but something that's hooked the evaluation function could indeed release the GIL. So this is really already racing against another thread resuming the generator already (and you shouldn't get an interpreter crash, just odd results from trying to run a generator from multiple threads at the same time, which isn't really a sane thing to do anyway).

DinoV avatar Jun 12 '24 23:06 DinoV

What's the status of this PR? It now has multiple merge conflicts.

vstinner avatar Nov 06 '24 12:11 vstinner