kotlinx.coroutines CancellationException can be thrown not only to indicate cancellation

trafficstars

https://betterprogramming.pub/the-silent-killer-thats-crashing-your-coroutines-9171d1e8f79b raises a good point: CancellationException should indicate that the coroutine in which it is raised is cancelled, but it's not always actually the case:

    fun testFoo() {
        val scope = CoroutineScope(Dispatchers.IO)
        val f = scope.async {
            delay(1000)
        }
        scope.cancel()
        var result = false
        runBlocking {
            launch {
                f.await() // throws CancellationException, silently stopping the `launch` coroutine
                result = true
            }.join()
        }
        check(result) { "issue" }
    }

Here, the programmer didn't make the mistake of catching a CancellationException and rethrowing it somewhere else, but still, the CancellationException leaked from a different scope.

Mar 06 '23 10:03 dkhalanskyjb

Need to dive deep into the implementation to see if this is possible, but it looks like modifying Deferred to throw a non-CancellationException may not be a good solution.

Consider this simplified scenario:

runBlocking {
  val result = async(Dispatchers.Default) {
    delay(5.seconds)
  }
  launch(Dispatchers.Unconfined) {
    result.await()
  }
  delay(1.seconds)
  cancel() // cancel the common scope
}

Here, we possibly have a race: the global scope is cancelled, which causes the coroutines in async and launch also to be cancelled. However, it can happen that this sequence of events happens before launch gets cancelled:

async gets cancelled,
the Deferred learns about the cancellation exception,
await finishes with a non-CancellationException in the launch,
which causes the launch to fail with a non-CancellationException,
with which the entire runBlocking fails!

So, the code can fail with one exception or rarely with another. This is just one simplified example of a broader class of races that we could introduce.

Mar 09 '23 12:03 dkhalanskyjb

await (probably another function since this one is taken) should return kotlin.Result

if await resumes with CE, then it means the current coroutine was cancelled;
if await resumes with Result, then async was completed before the current coroutine was cancelled, and it represents a value/CE/failure of the async.

Mar 09 '23 13:03 dovchinnikov

The above could be implemented as an extension function:

suspend fun Deferred<T>.awaitResult() : Result<T> {
  return try {
    Result.success(await())
  }
  catch (ce: CancellationException) {
    // this can "override" the original CE, but the current coroutine cancellation is preferred, so it's okay
    currentCoroutineContext().job.ensureActive()
    Result.failure(t)
  }
  catch (t: Throwable) {
    // resume awaitResult with the original throwable result to give a chance to handle it
    // even if the current coroutine is cancelled concurrently
    Result.failure(t)
  }
}

Mar 09 '23 13:03 dovchinnikov

This is a robust solution, but non-Kotlin-style. Kotlin tries to keep hierarchies of return types as thin as possible to save the users the boilerplate of unpacking the results. For example, Map<Int, String?>::get doesn't return an Optional<String?>, it just returns String?.

Mar 09 '23 13:03 dkhalanskyjb

Well, this is as minimal as Value | Throwable union can be at the moment

Mar 09 '23 13:03 dovchinnikov

Just a Value (with the implicit Throwable that can surface at any point in a JVM program) is even more minimal.

Mar 09 '23 13:03 dkhalanskyjb

But then you lose info about the origin of the Throwable, which is the very problem here. To check the origin, one has to catch the CE and use ensureActive manually.

Mar 09 '23 13:03 dovchinnikov

Another approach is to expose a subclass of CE, which will mean "another coroutine was cancelled":

try {
  await()
}
catch (dce: DeferredCancellationException) { // name TBD
  // deferred was cancelled
}
catch (ce: CancellationException) {
  // this coroutine was cancelled
}

Mar 09 '23 13:03 dovchinnikov

Another approach is to expose a subclass of CE, which will mean "another coroutine was cancelled":

What should happen if the code in async fails with a DeferredCancellationException?

Mar 12 '23 08:03 fvasco

If you are using deferred.await() in our code and you want to distinguish the case where the coroutine you're awaiting for was cancelled or not, you already can do it by explicitly checking deferred.isCancelled. However, I don't think the need to make this distinction happens very often.

IMO, the great takeaway from the original post on the "The Silent Killer" is this pattern:

try {
    doStuff()
} catch (error: Throwable) {
    coroutineContext.ensureActive()
    handleError(error)
}

That should be the recommended pattern to use instead of trying to catch CancellationException and ignoring it. That is a clear documentation thing, but we might also consider doing some utilities to help with it.

However, there is also one thing that can be fixed in the coroutines library itself in a backwards compatible way. The current state of affairs is that the coroutine that dies by a rogue CancellationException indeed does so completely silently, even though it is not cancelled itself. This can be clearly demonstrated by the following code:

suspend fun main() {
    val cancelledDeferred = CompletableDeferred<Int>()
    cancelledDeferred.cancel()
    supervisorScope {
        // This coroutine fails in a regular way
        launch(CoroutineName("one")) {
            error("Regular failure")
        }
        // This coroutine fails while waiting on cancelled coroutine
        launch(CoroutineName("two")) {
            cancelledDeferred.await()
        }
    }
}

Playground link

Neither coroutine "one" nor coroutine "two" is cancelled, both fail with exception, yet only the exception for the failure of the first coroutine is shown. We can consider changing this logic to report that the second coroutine has failed with CancelletationException, too.

Mar 13 '23 09:03 elizarov

The current state of affairs is that the coroutine that dies by a rogue CancellationException indeed does so completely silently, even though it is not cancelled itself

It can be handled by treating CE as failure if the current coroutine was not actually cancelled. The coroutine can be cancelled concurrently, though, but the concurrent cancellation behaviour with a caught CE should be equivalent to the behaviour with a caught error("Regular failure").

Treating CEs in a non-cancelled coroutine as failure would imply that the manual cancellation should be done differently:

// instead of this:
launch {
  throw CancellationException() // would be a failure
}

// one would need to use this:
launch {
  coroutineContext.cancel()
  coroutineContext.ensureActive() // this CE will be caught in cancelled state -> not a failure
}

Now, a function for cancel() + ensureActive() which returns Nothing is useful regardless of how this would be resolved: cancel() (comparing to throw CancellationException()) triggers the cancellation of the child coroutines right away without waiting for CE to be caught by the coroutine.

Apr 20 '23 14:04 dovchinnikov

If you are using deferred.await() in our code and you want to distinguish the case where the coroutine you're awaiting for was cancelled or not, you already can do it by explicitly checking deferred.isCancelled. However, I don't think the need to make this distinction happens very often.

IMO, the great takeaway from the original post on the "The Silent Killer" is this pattern:
try {
    doStuff()
} catch (error: Throwable) {
    coroutineContext.ensureActive()
    handleError(error)
}
That should be the recommended pattern to use instead of trying to catch CancellationException and ignoring it. That is a clear documentation thing, but we might also consider doing some utilities to help with it.

Please don't make that the recommended pattern. The point of CancellationException is that it's the only exception that safely bubbles up to the next checkpoint (be that launch or some try block with a custom subclass of CancellationException). This bubbling behavior is very important. The proposed pattern is bad because intermediate try-catch layers will believe there is an error instead of controlled backtracking. We use CancellationException as a signal for safe abortion in our codebase because that's the only exception which gets actively ignored by intermediate code layers.

The proper solution is to fix the APIs where possible and introduce new fixed APIs where backwards compatibility is needed. The ensureActive() workaround is imprecise and not a true causal relationship.

Apr 28 '23 07:04 wkornewald

I was debugging something like this for two days(in a production environment, had to add many log/counter and deploy and wait a reproducer, do that again for at least 20 times, VERY bad experience), even suspected at one point that it was a compiler BUG...

// simplified reproducer, the original is more complicated
// thus it's even harder to debug such kind of control-flow issue
suspend fun structuredConcurrency(toBeProcessedData:Collection<Any>){
    coroutineScope{
        for(data in toBeProcessedData)
            launch{handleData(data)}
    }
    println("all data was processed")
}

the all data was processed got printed so I assume that all data was processed(seriously, is this code looks anything wrong for anybody? but guess what, it's not, and NO EXCEPTIONS IS LOGGED! it looks like a mysterious break was in the for loop, the coroutine just vanishes... no return, no exception, how is that possible...

well, this is because deep down in the handleData(data) callstack, someone introduced withTimeout... a reasonable change, but blow up the whole thing :(

the point is, throwing CancellationException IS NOT THE SAME as cancelling a coroutine! ANYTHING CAN THROW ANY EXCEPTION, isn't that the point of structured concurrency? if we call something, it has to EITHER RETURN OR THROW, not vanishing without a trace...

Anyway I diverge, I agree with @elizarov "We can consider changing this logic to report that the second coroutine has failed with CancelletationException, too", and here's my advice:

when cancelling a coroutine, the automatically produced CancellationException should have a private mark on it / or the cancelling coroutine should have a private mark of the current propagating CancellationException
if the marked CancellationException propagated, and is in the correct structured concurrency way, keep the current behavior
if any other exception propagated, or if the afore mentioned CancellationException is not propagating in the correct structured concurrency way(for example the case mentioned in the original post, or the user grab that reference and throwing them in another coroutine etc), treat them as error(throw outwards and don't swallow them
please don't complicate things by introducing Result or un-subclassing TimeoutCancellationException from CancellationException(backwards incompatible and very hard to find in tons of catch(e:CancellationException){...}

Jul 19 '23 05:07 revintec

@revintec I think what you describe is a different issue: https://github.com/Kotlin/kotlinx.coroutines/issues/1374

Jul 24 '23 09:07 dkhalanskyjb

Here are my 5 cents.

When you are waiting for coroutine you might face two different states:

Coroutine completed and returned a value
Coroutine was stopped. Be it error or scope cancellation: it doesn't matter.

Should I expect the latter? It depends on how my code is structured.

suspend fun doAll() {
  coroutineScope { 
    async { 
      
    }.await() // I do not want to catch CE here, it doesn't make sense.
  }
}

But from the other hand:

suspend fun welcomeUser(getUserNameProcess:Deferred<String>) {
  try {
    val userName = getUserNameProcess.await()
    println("...$userName")
  }catch (_:CancellationException) {
    println("Failed to get user name, it seems you just cancelled this process!")
  }
}

See: coroutines are like processes. Sometimes you do not expect them to die (if this process is an integral part of your application). Sometimes you are ok with it (if they are really "external" processes).

I'd say that CE shouldn't cross the scope boundaries. If you are waiting for Job from another scope -- your own scope shouldn't be affected by foreign CE. Imagine your process killed because you waited for some other process that got SIGKILL.

I wish we had

sealed class AwaitResult<T> {
  abstract fun unwrap(): T
  class Stopped<T>(val e: CancellationException) : AwaitResult<T>() {
    override fun unwrap(): T = throw e
  }

  class Ok<T>(val v: T) : AwaitResult<T>() {
    override fun unwrap(): T = v
  }
}

suspend fun <T> Deferred<T>.newAwait(): AwaitResult<T> = ; /**/

suspend fun doAll() {
  coroutineScope {
    async {

    }.newAwait().unwrap() // I do not want to catch CE here, it doesn't make sense.
  }
}

suspend fun welcomeUser(getUserNameProcess: Deferred<String>) {
  // I do care about result here. I am not interested in catching foreign CE though
  when (val userName = getUserNameProcess.newAwait()) {
    is AwaitResult.Ok -> println("Welcome $userName")
    is AwaitResult.Stopped -> println("Failed to get user name, it seems you just cancelled this process!")
  }
}

Nov 28 '23 19:11 throwable-one

@throwable-one, it seems your problem is not with how exceptions are handled in the library overall but with the specific behavior of await. Here's one way to implement the function you want:

@OptIn(ExperimentalCoroutinesApi::class)
suspend fun <T> Deferred<T>.newAwait(): AwaitResult<T> {
    join()
    return when (val exception = getCompletionExceptionOrNull()) {
        null -> AwaitResult.Ok(getCompleted())
        is CancellationException -> AwaitResult.Stopped(exception)
        else -> throw exception
    }
}

Nov 29 '23 09:11 dkhalanskyjb

2 cents on the design, please destroy everything I say.

AFAIU CancellationException is just a signal to terminate a cancelled coroutine promptly without breaking the exceptions effect system (aka run the finally blocks). If there were no exceptions, then a coroutine could be terminated by simply suspending it without resuming it. However, currently kotlinx coroutines

uses CancellationException not only as a signal, but also as a result (the result of a cancelled deferred coroutine can be CancellationException).
doesn't check for the origin of the CancellationException (even me manually throwing a CancellationException can result in a silent coroutine death)

fun main() = runBlocking {
    launch {
        throw CancellationException()
    }
    println("Hello, world!")
}

Kotlinx coroutines also encodes three possible states:

Finished(success)
Finished(failure)
Finished(cancelled)

with a single cause: Throwable? where the cancelled and failure states are unionized in the cause != null (cancelled=cause is CancellationException). In practice it makes sense, as the result of a couroutine can be technically an exception or a result, but IMHO it's important to remember that we use an exception just because we are forced to do so just for the cancellation signal to not break the exception system. Once that the exception reaches the coroutine it should be unpacked to its actual meaning: is this the cancellation signal? then the coroutine state changes to cancelled. Is this not the cancellation signal? The the coroutine state changes to completed exceptionally. Again, if there were no exceptions and we implemented cancellation by just suspending the coroutine, we would likely have the cancelled state as a seperate one instead of encoding it as cause is CancellationException.

If the above makes sense, then kotlinx coroutines could:

not use CancellationException as the result of a deferred coroutine. For example, if the deferred coroutine is cancelled, then await could throw a CancelledCoroutineException (a failure).
add a way to identity the correct cancellation signals, for example, adding a token to each CancellationException which allows the coroutine completed with it to verify the token and update the state in response.

In an ideal world, CancellationException could be an internal implementation detail of kotlinx coroutines which is used to implement the signal transparently. In practice, developers catching Throwable will always also catch the CancellationException, so they have to rethrow it.

Instead of

interface Job {
    public fun cancel(cause: CancellationException? = null)
}

we could have

internal class CancellationException: Throwable

interface Job {
    public fun cancel(message: String? = null, cause: Throwable? = null)
}

fun Throwable.rethrowIfCancellationSignal() {
  if (this is CancellationException) { throw this }
}

The state of the coroutine conceptually would be:

sealed class State {
    data class Completed(val value: Any?): State()
    data class Failed(val cause: Throwable): State()
    data class Cancelled(val message: String, val cause: Throwable?): State()
}

and Deferred.await would do

fun await() = when(state) {
    is State.Cancelled -> throw CoroutineCancelledException(state.message, state.cause)
    is State.Completed -> state.value
    is State.Failed -> throw state.cause
}

This is just to say that IMHO this should not be "solved" by the developer using the coroutineContext.ensureActive() pattern, which is just a way of condensing the following logic:

if this coroutine is cancelled, suppress the other exception or cancellation signal incoming and replace it with my cancellation signal.
if this coroutine is not cancelled, then the exception is either a failure or a cancellation signal of another coroutine (how did it get there though!), handle them separately.

This is not a great pattern IMHO, and should be handled by the library if possible, even if getting there with backwards compatibility could be painful.

Feb 14 '24 10:02 francescotescari

Overall, I agree with your points.

should be handled by the library if possible, even if getting there with backwards compatibility could be painful.

This is the main problem. Yes, this would be painful, to the point I personally would call hopeless. kotlinx-coroutines guarantees, for example, that if library A depends on B, where B only uses stable APIs, and C depends on kotlinx-coroutines, then upgrading the version of kotlinx-coroutines should not require recompiling the library B. So, even if we somehow magically manage to make all code automatically behave strictly better after it's recompiled, there's a whole world of binary compatibility headaches of the form "what if A sends a Deferred to B," or "what if B returns a Deferred to A," or "what if B calls cancel, but A checks the cancellation exception," and so on, and so on. There's a plethora of interactions we'd need to consider to even hope to do this correctly. By my estimates, this would amount to dropping everything else we're doing and brainstorming just this task, and even then, there wouldn't be a guarantee that we'd be able to evolve this (really ubiquitous) API.

If there ever is a kotlinx.coroutines 2.0 that is allowed to break compatibility freely (NB: I don't mean indiscriminately), I'm certain this is an area we'd do differently, more in line with what you're saying.

if this coroutine is cancelled, suppress the other exception or cancellation signal incoming and replace it with my cancellation signal.

I agree that the pattern proposed above is suboptimal. Here's a better one:

try {
  // do anything
} catch (e: Throwable) {
  if (e is CancellationException) currentCoroutineContext().ensureActive()
  handleException(e)
}

The pattern without if (e is CancellationException) isn't in line with the prompt cancellation guarantee that a lot of our functions give. In essence, prompt cancellation states that cancellation of the calling coroutine takes precedence over normal values, but not over exceptions. Without the if, the cancellation exception really does suppress the other exception, which is inconsistent and probably non-desirable.

Feb 14 '24 11:02 dkhalanskyjb

kotlinx.coroutines kotlinx.coroutines copied to clipboard

CancellationException can be thrown not only to indicate cancellation

kotlinx.coroutines
kotlinx.coroutines copied to clipboard