gh-120321: Lock gen_send
Adds per-object lock for gen_send to fix crashes.
- Issue: gh-120321
What's the root cause of https://github.com/python/cpython/issues/120321? Until you know that, how do you know this fixes the problem?
This feels like slapping a lock on something and hoping it works.
What's the root cause of #120321? Until you know that, how do you know this fixes the problem?
The crash's root cause is here https://github.com/python/cpython/pull/120327/files#diff-2a0c449b68605ebd0872fd232e60ce7e838a77782a6d2e364764f99514fb508aR219
Those 3 lines can race and cause tstate->exc_info along with gi_exc_state to become invalid.
I locked the whole thing because TSAN is warning not just those few lines. It warns that the _PyFrame_StackPush and all the other stack accessor functions in there are racing as well. Though I don't think that results in a crash.
What about all the other places that tstate->exc_info is updated when we push or pop a generator frame?
I suspect that this doesn't crash only because specialization is turned off for free-threading:
def gen():
while True:
yield
def my_next(it):
for val in it:
return value
raise StopIteration
it = gen()
with concurrent.futures.ThreadPoolExecutor() as executor:
while True:
_ = executor.submit(lambda: my_next(it))
I suspect that this doesn't crash only because specialization is turned off for free-threading:
Yeah the entire gen_send has multiple races in multiple places. Likewise for the specializing interpreter. So the entire thing needs to be locked (what the GIL does at the moment).
gen_send isn't the problem. It contains an instance of the problematic code, but there are instances in bytecodes.c in _SEND and _FOR_ITER_GEN`.
Pushing and popping generator frames needs to be locked. The best way to do that would be to move popping and pushing into its own helper function.
I agree that pushing and popping frames need fixing, but gen_send is still problematic even with that. It has non-atomic accesses of fields on tstate and the frame which another thread could be concurrently modifying. In fact even after factoring out the generator frame pushing and locking just that, I still get crashes from the other issues in gen send.
OOI, what are the other crashes?
A selection of them after I've locked just the frame pushing
python: Objects/genobject.c:235: gen_send_ex2_unlocked: Assertion `gen->gi_exc_state.previous_item == NULL' failed
./Include/internal/pycore_frame.h:313: _PyFrame_GetGenerator: Assertion `frame->owner == FRAME_OWNED_BY_GENERATOR' failed.
./Include/internal/pycore_frame.h:77: _PyFrame_GetCode: Assertion `PyCode_Check(f->f_executable)' failed.
I wonder if we really only need the locks up to the transition to gen->gi_frame_state = FRAME_EXECUTING;. That would seem to work better with fixing SEND in the opcode as well as doing the unlocking there isn't going to be as easy.
I think other than the asserts that only leaves if (FRAME_STATE_SUSPENDED(gen->gi_frame_state)) { as the lone racing read. I'm not sure that this is fully really safe even w/ the GIL though. It seems like _PyEval_EvalFrameDefault wouldn't release the GIL between yielding a value and getting back to here, but something that's hooked the evaluation function could indeed release the GIL. So this is really already racing against another thread resuming the generator already (and you shouldn't get an interpreter crash, just odd results from trying to run a generator from multiple threads at the same time, which isn't really a sane thing to do anyway).
What's the status of this PR? It now has multiple merge conflicts.