Very WIP: Architecture for robust cancellation
Introduction
This commit is a first sketch for what I would like to do for robust cancellation (i.e. "Making ^C just work"). At this point it's more of a sketch than a real PR, but I think I've done enough of the design for a design discussion.
The first thing I should say is that the goals of this PR is very narrowly to make ^C work well. As part of that, we're taking a bit of a step towards structured concurrency, but I am not intending this PR to be a full implementation of that.
Given that some of this has been beaten to death in previous issues, I will also not do my usual motivation overview, instead jumping straight into the implementation. As I said, the motivation is just to make ^C work reliably at this point.
Setting the stage
Broadly when we're trying to cancel a task, it'll be in one of two broad categories:
-
Waiting for some other operation to complete (e.g. an IO operation, another task, an external event, etc.). Here, the actual cancellation itself is not so difficult (after all the task is not running, but suspended in a somehwat well-defined place). However, robust cancellation requires us to potentially propagate the cancellation signal down the wait tree, since the operation we actually want to cancel may not be the root task, but may instead be some operation being performed by the task we're waiting on (and we'd prefer not to leak those operations and have rogue tasks going around performing potentially side-effecting operations).
-
Currently running and doing some computation. The core problem is not really one of propagation (after all the long-running computation is probably what we're wanting to cancel), but rather how to do the cancellation without state corruption. A lot of the crashiness of our existing ^C implementation is just that we would simply inject an exception in places that are not expecting to handle it.
For a full solution to the problem, we need to have an answer for both of these points. I will begin with the second, since the first builds upon it.
Cancellation points
This PR introduces the concept of a cancellation request and a cancellation point.
Each task has a cancellation_request field that can be set externally (e.g. by ^C).
Any task performing computation should regularly check this field and abort its
computation if a cancellation request is pending.
For this purpose, the PR provides the @cancel_check macro. This macro turns a pending
cancellation request into a well-modeled exception. Package authors should insert a
call to the macro into any long-running loops. However, there is of course some overhead
to the check and it is therefor inappropriate for tight inner loops.
We attempt to address this with compiler support. Note that this part is currently incompletely implemented, so the following describes the design rather than the current state of the PR. Consider the cancel_check macro:
macro cancel_check()
quote
local req = Core.cancellation_point!()
if req !== nothing
throw(conform_cancellation_request(req))
end
end
end
where cancellation_point! is a new intrinsic that defines a cancellation point. The
compiler is semantically permitted to extend the cancellation point across any following
effect_free calls (note for transitivity reasons, the effect is not exactly the same,
but is morally equivalent). Upon passing a cancellation_point!, the system will
set the current task's reset_ctx to this cancellation point. If a cancellation request
occurs before the reset_ctx is cleared, the task's execution will be reset to the
nearest cancellation point. I proposed this mechanism in #52291.
Additionally, the reset_ctx can in principle be used to establish scoped cancellation
handlers for external C libraries as well although I suspect that there are not many
C libraries that are actually reset safe in the required manner (since allocation is not).
Note that cancellation_point! is also intended to be a yield point in order to faciliate
the ^C mechanism described below. However, this is not currently implemented.
Structured cancellation
Turning our attention now to the first of the two cases mentioned above, we tweak the task's
existing queue reference to become a generic (atomic) "waitee" reference. The queue is
required to be obtainable from with object via the new waitqueue generic function.
To cancel a waiter waiting for a waitable waitee object, we
- Set the waiter's cancellation request
- Load the
waiteeand call a new generic functioncancel_wait!, which shall do whatever synchronization and internal bookkeeping is required to remove the task from the wait-queue and then resumes the task. - The
waiterresumes in the wait code. It may now decide how and whether to propagate the cancellation to the object it was just waiting on. Note that this may involve re-queing a wait (to wait for the cancellation ofwaiteeto complete).
The idea here is that this provides a well-defined context for cancellation-propagation logic to run. I wanted to avoid having any cancellation propagation logic run in parallel with actual wait code.
How the cancellation propagates is a bit of a policy question and not one that I fully intend to address in this PR. My plan is to implement a basic state machine that works well for ^C (by requesting safe cancellation immediately and then requesting increasingly unsafe modes of cancellation upon timeout or repeated ^C), but I anticipate that external libraries will want to create their own cancellation request state machines, which the system supports. The implementation is incomplete, so I will not describe it here yet.
One may note that there are a significant number of additional fully dynamic dispatches
in this scheme (at least waitqueue and cancel_wait! and possibly in the future).
However, note that these dynamic dispatches are confined to the cancellation path, which
is not throughput-sensitive (but is latency sensitive).
^C handling
The handling of ^C is delegated to a dedicated task that then gets notified from the signal handler when a SIGINT is received (similar to the existing profile listener) task. There is a little bit of an additional wrinkle in that we need some logic to kick out a computational-task to its nearset cancellation point if we do not have any idle threads. This logic is not yet implemented.
Examples to try
julia> sleep(1000)
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
[1] macro expansion
@ ./condition.jl:134 [inlined]
[2] _trywait(t::Timer)
@ Base ./asyncevent.jl:195
[3] wait
@ ./asyncevent.jl:204 [inlined]
[4] sleep(sec::Int64)
@ Base ./asyncevent.jl:322
[5] top-level scope
@ REPL[1]:1
julia> collatz(n) = (n & 1) == 1 ? (3n + 1) : (n÷2)
collatz (generic function with 1 method)
julia> function find_collatz_counterexample()
i = 1
while true
j = i
while true
@Base.cancel_check
j = collatz(j)
j == 1 && break
j == i && error("$j is a collatz counterexample")
end
i += 1
end
end
find_collatz_counterexample (generic function with 1 method)
julia> find_collatz_counterexample()
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
[1] macro expansion
@ ./condition.jl:134 [inlined]
[2] find_collatz_counterexample()
@ Main ./REPL[2]:6
[3] top-level scope
@ REPL[3]:1
julia> wait(@async sleep(100))
^CERROR: TaskFailedException
Stacktrace:
[1] wait(t::Task; throw::Bool)
@ Base ./task.jl:367
[2] wait(t::Task)
@ Base ./task.jl:360
[3] top-level scope
@ REPL[4]:0
[4] macro expansion
@ task.jl:729 [inlined]
nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
[1] macro expansion
@ ./condition.jl:134 [inlined]
[2] _trywait(t::Timer)
@ Base ./asyncevent.jl:195
[3] wait
@ ./asyncevent.jl:204 [inlined]
[4] sleep
@ ./asyncevent.jl:322 [inlined]
[5] (::var"#2#3")()
@ Main ./REPL[4]:1
julia> @sync begin
@async sleep(100)
@async find_collatz_counterexample()
end
^CERROR: nested task error: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
[1] macro expansion
@ ./task.jl:1234 [inlined]
[2] _trywait(t::Timer)
@ Base ~/julia-cancel/usr/share/julia/base/asyncevent.jl:195
[3] wait
@ ./asyncevent.jl:203 [inlined]
[4] sleep
@ ./asyncevent.jl:321 [inlined]
[5] (::var"#45#46")()
@ Main ./REPL[26]:3
...and 1 more exception.
Stacktrace:
[1] sync_cancel!(c::Channel{Any}, t::Task, cr::Any, c_ex::CompositeException)
@ Base ~/julia-cancel/usr/share/julia/base/task.jl:1454
[2] sync_end(c::Channel{Any})
@ Base ~/julia-cancel/usr/share/julia/base/task.jl:608
[3] macro expansion
@ ./task.jl:663 [inlined]
[4] (::var"#43#44")()
@ Main ./REPL[5]
As noted above, the @Base.cancel_check is not intended to be required in the inner loop.
Rather, the compiler is expected to extend the cancellation point from the start of the loop
to the entire function. However, this is not yet implemented.
I hope this will fix (more of a checklist for me at this point) #4037 #6283 #25790 #29369 #36379 #42072 #43451 #43451 #45055 #47839 #50045 #56462 #56545 #58105 #58849 #58689 Closes #49541 Part of #33248 #52291 Refs, but yet to be decided #58259 #35026 #39699
TODO
- [ ] Audit all uses of wait()
- [x] Look into libuv write cancellation
- [ ] Implement libuv write cancellation on Windows
- [ ] Submit libuv write cancellation PR upstream
- [ ] ~~Look into BLAS cancellation~~ punted
- [ ] (Maybe in a future PR) More compiler optimizations for cancellation_point!
- [ ] Merge assymetric fences
- [ ] (Maybe in a future PR) Optimize reset_ctx establishment speed
- [ ] (Separate PR) Pointer-int-union optimizations
- [ ] Implement the "unfriendly" cancellation types
- [ ] Early interruption of inference
- [ ] Outlined cancellation handlers
Is the cancellation check independent of yielding? I think in most code, authors will not want to think about dedicated cancellation handling and just want their code to "behave well" when cancelled. This is analogous to how most authors don't care about task scheduling details, but want their task to "play fair" by yielding. Should authors, when writing loops insert both yields and cancellation checks? Or do the checks yield? Or should we have a function/macro to both check and yield?
Edit: I have missed that this is already answered:
Note that cancellation_point! is also intended to be a yield point in order to faciliate the ^C mechanism described below
Sorry for the noise
Just want to say that I really like this approach! ❤️
I especially like the design of making Base.@cancel_check a point that the task can longjmp back to from effect-free code, as it (presumably) doesn't require waiting until said code finishes to hit a cancellation exit (as shown with the Collatz example). I am a bit confused how this will be implemented - will the ^C handling logic detect when a task is executing code within an effect-free region? Is the idea that reset_ctx (which is stored on the task) is cleared by the compiler once the effect-free region is left (which is also presumably a point where the task cooperatively checks for a cancellation request), and so the ^C handling logic knows when it can just forcibly suspend and reset the task via longjmp to reset_ctx?
Regarding async logic implemented in Base, what is the intended policy for whether to cancel just the waiter, or waiter+waitee? Should we expect that all async resources from Base will cancel on/within any async call, for predictability? That is to say, if we have a producer-consumer setup on a Channel, can I expect that both sides will receive an exception once/while they interact with the Channel? Similarly wondering about this for Threads.Condition, Base.Event, etc. If the answer is "yes", will there be a way to opt-out of this (in the case that the resource needs to continue operating normally to allow surrounding library logic to cancel itself)?
Regarding more than just ^C, can we expect that SIGTERM (and maybe also SIGSTOP and other fun signals) will one day invoke this logic as well? I can imagine that when trying to terminate a complex application, having the first course of action be to cancel ongoing work is conducive to a safe and expedient shutdown. We would of course want to then do the finalizer and atexit dance, which should hopefully now be able to do their jobs without concern for resources being locked or otherwise unavailable for cleanup.
Regarding timeouts and other structured cancellation, will it be possible to target a cancellation request at a particular task? I can imagine that this will avoid having to implement APIs like wait(obj; timeout), as we can just wrap the wait(obj) call with some logic that will send a cancellation request to just that task if the timeout expires before wait returns. This would also make it easy for libraries like Dagger to request arbitrary user code (running within a Dagger-launched task) to cancel when Dagger decides it's desirable (possibly without direct user input).
Aside: I do think it's worth thinking more on whether we want users to have a way to target the cancellation at a library/task/arbitrary machinery, but as mentioned, this is mostly an orthogonal concern.
I am a bit confused how this will be implemented - will the ^C handling logic detect when a task is executing code within an effect-free region?
Kind of. The canceler checks reset_ctx. If non-null, it sends a signal to the thread, which then longjmps to reset_ctx if still non-null and if cancellation_request is set on the currently running task.
Is the idea that
reset_ctx(which is stored on the task) is cleared by the compiler once the effect-free region is left (which is also presumably a point where the task cooperatively checks for a cancellation request)
Yes
Regarding async logic implemented in Base, what is the intended policy for whether to cancel just the waiter, or waiter+waitee?
Policy decision by the async library, so I'm not really expressing a preference. For now, wait cancels the waitee and there's
wait_nocancel to opt out. In the future there could be fancier APIs for cancellation scope.
Should we expect that all async resources from Base will cancel on/within any async call, for predictability?
I was at this point not expecting cancellation to propagate through channels and conditions - rather I was expecting that it would cancel the wait on those objects and then the thrown exception might potentially cancel the expected producer in its cleanup scope - however, that's a bit of an orthogonal API design question that I don't have a strong opinion on.
Regarding more than just ^C, can we expect that SIGTERM (and maybe also SIGSTOP and other fun signals)
Maybe - I could imagine SIGTERM trying to cancel all tasks in the system simultaneously with this mechanism - I don't know if the tree-based cancellation makes sense there, but it could be useful for graceful shutdown.
Regarding timeouts and other structured cancellation, will it be possible to target a cancellation request at a particular task?
PR provides a cancel! API.
I guess I should have said that I want the cancellation point to be a preemption point rather than a yield point. We don't currently have that concept, so a bit of an open question whether those are different, but I wanted to be precise.
Capturing some slack discussion with @vtjnash. This reflects my best understanding, but @vtjnash was trying to make a larger point that I don't quite understand.
- Are plain
wait,yield,yieldto, etc. cancellation points?
Probably not. The correctness of these functions depends on being paired with a unique schedule that resumes it and has correctness guarantees that need to be enforced at a higher level. There does not seem to be any good to way to make these automatic cancellation points.
@vtjnash provided the example ct = current_task(); t = @task(yieldto(ct); nothing); yieldto(t); wait(t). This task t cannot be canceled, because it's not at a cancellation point.
- Are locks cancellation points?
I think we need to have both versions, with the user selecting the appropriate one.
- How is the cancellation guaranteed without seq_cst on the waitee field (which would be expensive).
I think some variant of the following works:
Cancelling thread:
while (true)
jl_atomic_store_relaxed(&t->cancellation_request, req);
SYS_membarrier
if cancel_wait!(jl_atomic_load_acquire(&t->waitee), t)
break
end
end
Waiting thread:
jl_atomic_store_release(&t->waitee, wait);
barrier();
if (cancelled(jl_atomic_load_relaxed(&t->cancellation_request))) throw();
wait();
- What happens to threads that haven't started yet.
I think this is a cancellation point and the thread dies. This is different from @async wait() because you can't put try/catch around it.
- Does having the
waiteefield prevent GC of events that will never fire?
I think it can be weakref.
- Libuv does not support write cancellation
Probably should be fixed, ignore for now.
- The term "safe cancellation" is inappropriate, because it can fail and hang, which doesn't feel very safe.
Not attached to the term. The naming was due to the possibility of introducing more unsafe cancellation variants (to be used on timeout or repeated ^C) that, while being more likely to succeed, could leave the system in an inconsistent state. Useful for looking around for debugging, but not semantically sound.
Now with compiler and reset_ctx support, courtesy of claude (note the absence of explicit cancellation points inside the inner loop):
julia> collatz(n) = (n & 1) == 1 ? (3n + 1) : (n÷2)
collatz (generic function with 1 method)
julia> function find_collatz_counterexample_inner()
i = 1
while true
j = i
while true
j = collatz(j)
j == 1 && break
j == i && return j
end
i += 1
end
end
find_collatz_counterexample_inner (generic function with 1 method)
julia> function find_collatz_counterexample2()
@Base.cancel_check
return find_collatz_counterexample_inner()
end
find_collatz_counterexample2 (generic function with 1 method)
julia> find_collatz_counterexample2()
^CERROR: CancellationRequest: Safe Cancellation (CANCEL_REQUEST_SAFE)
Stacktrace:
[1] handle_cancellation!(_req::Any)
@ Base ./task.jl:1423
[2] macro expansion
@ ./condition.jl:133 [inlined]
[3] find_collatz_counterexample2()
@ Main ./REPL[2]:2
[4] top-level scope
@ REPL[3]:1