trio icon indicating copy to clipboard operation
trio copied to clipboard

Add exception message for easier debugging with `Cancelled`

Open Zac-HD opened this issue 9 months ago • 10 comments

Occasionally we end up digging through logs trying to work out why something was cancelled, and eventually it occurred to me that Trio could help with this. I don't have a complete design proposal, but it would be great to handle each of three cases:

  • a deadline was reached.
  • a sibling task in the Nursery raised an exception, including the name of the erroring task
  • someone called CancelScope.cancel(), including the name of the task in which this was called
    • add a new reason: str | None = None param, so that users can explicitly annotate why they cancelled something

I think we could represent this as a string argument to Cancelled like any other exception message; the only tricky part will be threading it through all the cancellation-delivery machinery 🙂

Comments and design input most welcome, of course!

Zac-HD avatar Mar 27 '25 06:03 Zac-HD

Might be better to not use a string argument, instead pass like an enum + task name, then have __str__ build the arg. That way it's introspectable if need be, and it might be better performant since it doesn't need to build the string 99% of the time.

TeamSpen210 avatar Mar 27 '25 06:03 TeamSpen210

I'm not so sure being introspectable is that important for figuring out why something was cancelled. Probably being more varied in what information we show (setting __context__ for case 2? Maybe stitching together more stack frames? I'm not sure.) is better especially for debugging. I guess that doesn't detract from providing this as an enum though.

A5rocks avatar Mar 27 '25 07:03 A5rocks

I'm somewhat inclined against setting __context__, because it's so easy for this to be lost/overwritten as exceptions are caught and re-raised.

An example usecase here is "I have dozens of Cancelled in an ExceptionGroup, does anything stand out?" -> set aside those cancelled due to a sibling task, inspect remainder. That does argue for @TeamSpen210's idea of passing multiple arguments and assembling the __str__ later - we could also store them as attributes, and then easily split exception groups.

I really love "collect and show everything" as an error-handling default, but when you get to 150k line tracebacks, programmatically determining what we can safely omit becomes very very important 😅

Zac-HD avatar Mar 27 '25 07:03 Zac-HD

Alright, that makes sense as to why not to add more context for case 2, probably case 3 too. Too bad since a stack trace of the task that called .cancel would function as a reason without having to opt in.

I think providing at least an attribute with an enum is a good idea. I'm not sure what the best strategy for sum types in Python is (because some enum members sound like they would need extra context), but maybe we can get away with .reason: tuple[Literal[CancelReason.DeadlineReached]] | tuple[Literal[CancelReason.NurseryShutdown], str] | .... Or just have an extra_information attribute that takes in whatever.

A5rocks avatar Mar 27 '25 15:03 A5rocks

I'm now thinking of having three attributes source: Literal["deadline", "nursery", "explicit"], source_task: str, and reason: str | None; I also tend to stick to Literal over enums these days.

Storing tracebacks makes me pretty nervous; it's easy to blow up memory use due to delayed garbage collection of all the extra objects they keep references to. It does sound nice to have as an opt-in debug mode, but I don't actually have a specific case in mind where it'd help.

Zac-HD avatar Mar 27 '25 18:03 Zac-HD

Storing tracebacks makes me pretty nervous; it's easy to blow up memory use due to delayed garbage collection of all the extra objects they keep references to. It does sound nice to have as an opt-in debug mode, but I don't actually have a specific case in mind where it'd help.

There are also use cases that rely on the GC so object finalizers get fired (including one of my own apps).

mikenerone avatar Mar 31 '25 20:03 mikenerone

There are also use cases that rely on the GC so object finalizers get fired (including one of my own apps).

This is not strictly speaking supported behavior for Python, and risky to rely on - you can get broken by CPython updates, or library changes, or PyPy having different GC semantics, etc. - but I agree that it's reasonable for Trio to avoid breaking such use-cases if we can.

Zac-HD avatar Mar 31 '25 21:03 Zac-HD

On Gitter @oremanj and @arthur-tacca mentioned that they think cancellation reasons are incorrectly implemented here. Specifically what I get is that cancellation should print the innermost (? i.e. the first cancel status w/ a reason's reason) reason at the point where we print out the traceback.

I think these are the implementation changes we would do:

  • remove the attribute from cancelled exceptions
  • look up the cancel reason in __repr__

@Zac-HD and @jakkdl do you think that's missing something/worse than current behavior?

A5rocks avatar Jul 18 '25 09:07 A5rocks

My suggestion was that it should be the outer most cancelled scope, not inner most and not (as in the current implementation) the first one that was triggered. The logic is, the outermost scope is the once that the Cancelled exception will ultimate bubble up to so that is the more honest reason about why the code is cancelled now.

This collapsed section has code to show the exact logic I'm suggesting
def get_enclosing_cancel_scopes() -> Iterable[trio.CancelScope]:
    status = current_task()._cancel_status
    while status is not None:
        yield status._scope
        status = status.parent

def get_cancelled_scope() -> "trio.CancelScope | None":
    outermost_cancelled = None
    for scope in get_enclosing_cancel_scopes():
        if scope.cancel_called:
            outermost_cancelled = scope
        if scope.shield:
            break
    assert outermost_cancelled is not None
    return outermost_cancelled

The distinction is important if a scope is cancelled and then another, more outer, scope is cancelled before the current cancellation finishes being handled - for example, it is cancelled from another task before this one gets a chance to run and handle the cancellation, or another scope is cancelled in a finally: or except Cancelled block.

This collapsed section has code showing the destination cancel scope changing.

These two functions can be added to test_cancelled.py and right now the assertions in them would flag.

async def test_cancel_reason_nested() -> None:
    with trio.CancelScope() as outer_cs:
        with trio.CancelScope() as inner_cs:
            inner_cs.cancel(reason="wrong")
            try:
                await trio.lowlevel.checkpoint()
            except Cancelled as ex:
                outer_cs.cancel(reason="right")
                assert get_cancelled_scope()._cancel_reason.reason == "right"  # This passes
                assert ex.reason == "right"  # This fails - returns "wrong"
                raise


async def test_cancel_reason_nested_with_checkpoint() -> None:
    with trio.CancelScope() as outer_cs:
        with trio.CancelScope() as inner_cs:
            inner_cs.cancel(reason="wrong")
            try:
                await trio.lowlevel.checkpoint()
            finally:
                outer_cs.cancel(reason="right")
                try:
                    await trio.lowlevel.checkpoint()
                except Cancelled as ex:
                    assert get_cancelled_scope()._cancel_reason.reason == "right"  # This passes
                    # The following line fails with "wrong" - surprising IMO given that outer scope
                    # was already cancelled when the cancelled exception was created.
                    assert ex.reason == "right"
                raise

I also suggested changing the API slightly: instead of including the string in the Cancelled repr, it would be a separate function you could call get_current_cancellation_reason() (or it would be part of the cancellation scope, and you would use get_cancelled_scope() like in the above code snippet. I can see that there is an advantage to putting the message in the Cancelled repr: that's arguably a simpler API and would be printed in any traceback that includes it. But Cancelled exceptions are usually caught (by a scope) so it doesn't seem that common to me that you'd actually see a traceback with a Cancelled error (at least not one that you'd care about). But I could be wrong about that(?)

One danger from mixing the above idea with putting the message in Cancelled repr: by the time you come to print out the Cancelled, especially as part of a traceback, it may well be that you're outside the scope that caused it and the cancellation reason for the current execution point has changed (or not cancelled at all). So the two ideas (seprate API to get reason, dynamically chosen reason) probably have to go together.

arthur-tacca avatar Jul 19 '25 09:07 arthur-tacca

I wasn't really convinced by your examples at first, but after reading through gitter @oremanj managed to get through to me. This (modified) example in particular:

async def test_cancel_reason_nested() -> None:
    with trio.CancelScope() as outer_cs:
        try:
            with trio.CancelScope() as inner_cs:
                inner_cs.cancel(reason="inner")
                try:
                    await trio.lowlevel.checkpoint()
                finally:
                    outer_cs.cancel(reason="outer")

        except trio.Cancelled as ex:
            assert ex.reason == "outer"  # fails
            raise

But Cancelled exceptions are usually caught (by a scope) so it doesn't seem that common to me that you'd actually see a traceback with a Cancelled error (at least not one that you'd care about). But I could be wrong about that(?)

My impression from others is that it's fairly common to see Cancelled in tracebacks, e.g. with trio.fail_after you get them in the __context__ of the TooSlowError. But it may well be that it's not usually hugely important to see the reason in tracebacks and it's fine to leave it as a dynamic inspection tool - but that requires being aware of it.

One danger from mixing the above idea with putting the message in Cancelled repr: by the time you come to print out the Cancelled, especially as part of a traceback, it may well be that you're outside the scope that caused it and the cancellation reason for the current execution point has changed (or not cancelled at all). So the two ideas (seprate API to get reason, dynamically chosen reason) probably have to go together.

hm. We can definitely update the reason of the inner scope when we cancel the outer, but less sure of a good&reliable way of getting that into the __repr__.

I'm not sure I'll have the time to do a PR for this personally, so others are very welcome.

jakkdl avatar Jul 22 '25 13:07 jakkdl