TaskStatus.started() being a no-op if cancelled can cause deadlocks
TaskStatus.started() starts with the logic:
# If the old nursery is cancelled, then quietly quit now; the child
# will eventually exit on its own, and we don't want to risk moving
# children that might have propagating Cancelled exceptions into
# a place with no cancelled cancel scopes to catch them.
assert self._old_nursery._cancel_status is not None
if self._old_nursery._cancel_status.effectively_cancelled:
return
However, this logic does not hold if the function calling started() then proceeds to do its work in a shielded cancel scope. This describes trio-asyncio, and caused the deadlock reported in https://github.com/python-trio/trio-asyncio/issues/115.
trio-asyncio can work around this by shielding the call to start(), but maybe Trio could also reconsider the logic in started()? I think at minimum it's safe to move the task if the new nursery is also effectively cancelled. It should also be safe to move the task if no Cancelled exception has actually been raised yet, but that's hard to track and maybe not worth it?
Maybe started() should raise an exception, RuntimeError or whatever, if the
call to start has ever been in the cancelled state?
On Fri, Dec 1, 2023, 10:06 Joshua Oreman @.***> wrote:
TaskStatus.started() starts with the logic:
# If the old nursery is cancelled, then quietly quit now; the child # will eventually exit on its own, and we don't want to risk moving # children that might have propagating Cancelled exceptions into # a place with no cancelled cancel scopes to catch them. assert self._old_nursery._cancel_status is not None if self._old_nursery._cancel_status.effectively_cancelled: returnHowever, this logic does not hold if the function calling started() then proceeds to do its work in a shielded cancel scope. This describes trio-asyncio, and caused the deadlock reported in python-trio/trio-asyncio#115 https://github.com/python-trio/trio-asyncio/issues/115.
trio-asyncio can work around this by shielding the call to start(), but maybe Trio could also reconsider the logic in started()? I think at minimum it's safe to move the task if the new nursery is also effectively cancelled. It should also be safe to move the task if no Cancelled exception has actually been raised yet, but that's hard to track and maybe not worth it?
— Reply to this email directly, view it on GitHub https://github.com/python-trio/trio/issues/2895, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEU42HCD6HJ6OOQFBXNQVDYHIMCPAVCNFSM6AAAAABADGRH7KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZDCMZVGYZDKMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>
i noticed this too after hitting that trio-asyncio bug a few months ago. at the time, i wrote up a report and a corresponding patchset to propose, but i got busy and never got around to cleaning up the patchset and posting both. here's that old report (it is partially repetitive of what you wrote above, but i share it still because it notes some other symptoms alongside the deadlock and compares some alternatives for fixing this):
task_status.started doesn't reparent a task if it is effectively cancelled at the time that it calls started. this causes some problems:
-
deadlocks:
@asynccontextmanager async def open_foo(): async def serve(*, task_status): send_channel, receive_channel = trio.open_memory_channel(100) with receive_channel: task_status.started(send_channel) with CancelScope(shield=True): # Shield to ensure that even if we get a cancel request, we keep # processing until the queue is empty and closed. # # This is important because we need to guarantee that everything added # to the queue before send_channel is closed will get processed # (including things added to the queue by cleanup code). (Indeed, this # behavior is service-nursery-like.) async for item in receive_channel: print(f"{item=!r}") async with trio.open_nursery() as nursery: send_channel = await nursery.start(serve) with send_channel: # In real code we'd yield an object with methods that add items to # the queue. yield async def main(): with CancelScope() as cancel_scope: cancel_scope.cancel() async with open_foo(): print("this should get printed and then the program should exit")(in real code using
trio-asyncio, i encountered this deadlock occurring. the example above is a minimal reproducer for https://github.com/python-trio/trio-asyncio/issues/115 without all of thetrio-asynciostuff.)in
serve, whenstartedreturns it indicates toservethatstarthas been unblocked, meaning that the context manageropen_foo()is guaranteed to closesend_channel. with this knowledge,serveknows that it's safe for it to shield until bothsend_channelis closed and the channel buffer is empty.the bug is that the
started()call returns without reparenting theservetask. that is,started()succeeds and tellsservethatstart()has been unblocked, butstart()has not actually been unblocked.thus
start()andserveget stuck in a deadlock waiting for each other. this deadlock is also immune to KI/SIGINT; you have to hit it with SIGTERM/etc. instead. -
if a task is effectively cancelled when it calls
started()but becomes not effectively cancelled before cancellation is delivered to it,triocan mistakenly deliver cancellation to it. e.g.:async def foo(*, task_status): task_status.started() # This task is no longer effectively cancelled. await trio.lowlevel.checkpoint() print("this should get printed") async def main(): async with trio.open_nursery() as nursery: with CancelScope() as cancel_scope: cancel_scope.cancel() await nursery.start(foo) -
exceptions raised after
task_status.started()can come out of the wrong place. e.g.:async def foo(*, task_status): eventual_parent_nursery = trio.lowlevel.current_task().eventual_parent_nursery task_status.started() raise RuntimeError( f"I should come out of {eventual_parent_nursery!r}.__aexit__, not start()" ) async def main(): async with trio.open_nursery() as nursery: nursery.cancel_scope.cancel() try: await nursery.start(foo) except RuntimeError: assert False -
Task.eventual_parent_nurserycan violate its documentation andTask.parent_nurserycan violate the documentation ofstart. e.g.:async def foo(eventual_parent_nursery, *, task_status): task_status.started() assert trio.lowlevel.current_task().parent_nursery is eventual_parent_nursery assert trio.lowlevel.current_task().eventual_parent_nursery is None async def main(): async with trio.open_nursery() as nursery: nursery.cancel_scope.cancel() await nursery.start(foo, nursery) -
#2544 is also in part due to this. its
startedcall returns without reparentingt, sostartblocks untiltfinishes completely. then, aftertfinishes,startraisesCancelledfrom itstmp_nursery.__aexit__.DECEMBER EDIT: the interal nursery of
start()should no longer raiseCancelledfrom itstmp_nursery.__aexit__(#1696). so one of the two bugs causing #2544 has been fixed, and the remaining problem causing it is thatstarted()is being a no-op.
a comment in task_status.started discusses why it has the current logic:
https://github.com/python-trio/trio/blob/4f17d2b2693d961f2c12df76a5ce4f80ec3f3178/trio/_core/_run.py#L791-L796
proposal: make task_status.started() never be a no-op return. it should either reparent and return, or fail to reparent and raise. there should be no third option where it returns success but secretly failed.
there are some different possible approaches to doing that:
-
let
task_status.startedsucceed when called under an effectively cancelled scope (started()isn't a checkpoint, after all, so it has no right to raiseCancelled).to replace https://github.com/python-trio/trio/blob/4f17d2b2693d961f2c12df76a5ce4f80ec3f3178/trio/_core/_run.py#L791-L796, i.e. to prevent teleporting an in-flight
Cancelledto somewhere in the tree without the correct cancel scope to catch it, havetask_status.startedraiseRuntimeErrorif it's called while handlingCancelled(s). -
don't let
task_status.startedsucceed when called under an effectively cancelled scope. have it raise (Cancelled?RuntimeError? idk).this is more limiting on users than the prior option. Trio would probably need to make
started()raise if the current scope has ever been effectively cancelled (not just if it is currently effectively cancelled) because there could beCancelleds insys.exc_info()waiting to raise up.at the moment i am hesitant about this second option because it feels to me that if the following is working as intended
async def foo(): pass with CancelScope() as cs: cs.cancel() async with trio.open_nursery() as nursery: nursery.start_soon(foo) print("this is printed because no cancel points were passed in foo") # and the below behaves identically in terms of cancel points (ref: # https://github.com/python-trio/trio/issues/1457#issuecomment-681091005) with CancelScope() as cs: cs.cancel() await foo() print("this is printed because no cancel points were passed in foo")then the following should also work
async def foo(*, task_status = trio.TASK_STATUS_IGNORED): task_status.started() await trio.lowlevel.checkpoint() with CancelScope() as cs: cs.cancel() async with trio.open_nursery() as nursery: await nursery.start(foo) print("this is printed because no cancel points were passed in foo's startup")the first option would let this work but the second option would have this raise something instead.
having the guarantee that [
await start(afn)won't visit a cancel point ifafn's startup code doesn't] seems useful when implementing objects with behavior similar to service nurseries. in such situations,like_a_service_nursery().__aenter__needs to start the background task(s) unconditionally (even if__aenter__(and thusstarted()) is in a cancelled scope) and not cancel them until theasync with like_a_service_nursery()block's body exits, because there may be a shield in the body!
here's a proposed patchset for the first option: https://github.com/python-trio/trio/compare/master...3e87b5c6852268825d176df87d2a4ad718a5ee49
DECEMBER EDIT: i've rebased that patchset onto master, adapted it for #1696 and cleaned it up. i will open a PR to propose it.