tractor icon indicating copy to clipboard operation
tractor copied to clipboard

`Context` semantics for cross-actor-task cancellation and overruns

Open goodboy opened this issue 2 years ago • 0 comments

Finally!

This is a rigorous rework of our Context and ContextCancelled semantics to be more like trio.CancelScope for cross-actor linked tasks!

High level mechanics:

  • if any actor-task requests a remote actor to be cancelled, either by calling Actor.cancel() or the internal Actor._cancel_stask() (which is called by Context.cancel()), that requesting actor's uid is set on the target far-end Context._cancel_called_remote and any eventually raised ContextCancelled.canceller: tuple.
  • this allows any actor receiving a ContextCancelled error-msg to check if they were the cancellation "requester", in which case the error is not raised locally and instead the cancellation is "caught" (and silently ignored inside the Portal.open_context() block, and possibly @tractor.context callee func) just like normal trio cancel scopes: if you call cs.cancel() inside some scope or nursery block.
  • if any received ContextCancelled.canceller != current_actor().uid, then the error is raised thus making it explicit that the cancellation was not anticipated by the cancel-receiving (actor) task and thus can be subsequently handled as an (unanticipated) remote cancellation.

Also included is new support and public API for allowing contexts (and their streams) to "allow overruns" (since backpressure isn't really possible without a msg/transport protocol extension): enabling the case where some sender is pushing msgs faster then the other side is receiving them and no error is received on the sender side.

Normally (and by default) the rx side will receive (via RemoteActorError msg) and subsequently raise a StreamOverrun that's been relayed from the sender. The receiver then can decide how to the handle the condition; previously any sent msg delivered that caused an overrun would be discarded.

Instead we now offer a allow_overruns: bool to the Context.open_stream() API which adds a new mode with the following functionality:

  • if an overrun condition is hit, we spawn a single "overflow drainer" task in the Context._scope_nursery: trio.Nursery which runs Context._drain_overflows() (not convinced of naming yet btw :joy:):
  • this task has one simple job: it does the work of calling await self._send_chan(msg) for every queued up overrun-condition msg (stored in the new Context._overflow_q: deque) in a blocking fashion but without blocking the RPC msg loop.
  • when the ._in_overrun: bool is already set we presume the task is already running and instead new overflow msgs are queued and a warning log msg emitted.
  • on Context exit any existing drainer task is cancelled and lingering msg discarded but reported in a warning log msg.

Details of implementation:

  • move all Context related code to a new ._context mod.
  • extend the test suite to include full coverage of all cancel request and overruns (ignoring) cases.
  • add ContextCancelled.canceller: tuple[str, str].
  • expose new Context.cancel_called, .cancelled_caught and .cancel_called_remote properties.
  • add Context._deliver_msg() as the factored content from what was Actor._push_result() which now calls the former.
  • add a new mk_context() -> Context factory.

Possibly still todo:

  • [ ] is maybe no_overruns or ignore_overruns or something else a better var name?
  • [ ] better StreamOverrun msg content?
  • [ ] make Context.result() return None if the default is never set, more like a function that has no return?
  • [ ] drop lingering ._debug.breakpoint() comments from ._context.py?
  • [ ] ensure we don't need the cs.shield = True that was inside Context.cancel()?
  • [ ] get more confident about dropping the previous Context._send_chan.aclose() calls (currently commented out)?
  • [ ] not use @dataclass for Context and instead go msgspec.Struct?
  • [ ] KINDA IMPORTANT maybe we should allow relaying the source error traceback to cancelled context tasks which were not the cause of their own cancellation?
    • i.e. the ContextCancelled.canceller != current_actor().uid case where it might be handy to know why, for eg., some parent actor cancelled some other child that you were attached to, instead of just seeing that it was cancelled by someone else?
  • [ ] apparently on self-cancels (which i still don't think is fully tested..?) we can get a relayed error that has no .canceller set?
    • see this example from the new remote annotation testing in piker screenshot-2023-12-21_15-31-04
      • which when you inspect the input err to Context._maybe_raise_remote_err() screenshot-2023-12-21_15-30-53

goodboy avatar Apr 14 '23 21:04 goodboy