kotlinx.coroutines Parallelism Compensation for CoroutineDispatcher and runBlocking

This is a follow-up on #4084 and #3983 and an implementation of a different approach.

In #4084 the main idea was to try to preserve the parallelism by transferring the computational permits from runBlocking thread and then returning it back. It failed because there are scenarios when we should not let go the permit at all, otherwise we may introduce a deadlock.

In this solution we implement a parallelism compensation mechanism: the main difference with the previous patch is that now runBlocking does not release the computational permit and instead increases the allowed parallelism limit of the associated coroutine dispatcher for the duration of the park() (i.e., while it has nothing to process). An important detail here is that while it is easy to increase the parallelism limit, it is difficult to decrease it back and at the same time make sure that the effective parallelism stays within the limit. In this solution we allow coroutine dispatchers to exceed the instantaneous parallelism limit with the property that they take care of eventually adjusting the effective parallelism to fit to the limit. runBlocking basically can increase the parallelism limit, but it can only request to reduce the parallelism limit back.

Parallelism compensation may potentially cause an unlimited inflation of the scheduler's thread pool, and in this case we are willing to trade thread starvation for OOMs.

It is easy to notice that parallelism compensation cannot be implemented for LimitedDispatcher, because it would break its contract on "hard" parallelism limit. So parallelism compensation is not a general property of coroutine dispatchers, but we can for sure implement it for Dispatchers.Default and Dispatchers.IO. As an alternative to .limitedParallelism, .softLimitedParallelism is introduced which supports parallelism compensation.

Performance considerations:

The intuition behind increasing the parallelism limit only for the duration of the park() is that while the thread of runBlocking is parked, it is not actually using its permit, meaning that its thread is not executing and is not taking up a CPU (of course, it is not exactly so, but should be close enough to reality). So with compensation, the "factual" parallelism is expected to be close to the target parallelism.
The implementation of "decompensation" in CoroutineScheduler works like this: the worker holding a CPU permit checks if it should let go the permit after it completes each CPU task. If there are pending requests for decompensation, it releases its permit. The "hot path" for CPU workers now includes checking the cpuDecompensationRequests atomic, which is expected to have non-zero value rarely. It is expected to change rather infrequently and thus should have negligible impact on performance in "good" case when there are no runBlockings running on dispatcher threads.
There is also an optimisation for the "runBlocking-heavy" case: before increasing the parallelism limit it first checks if it can instead remove one decompensation request. If it succeeds, it means that some CPU worker would just continue to run instead of first releasing the permit and then reacquiring it back (not always exactly so, but it is a good example). It is not a goal though to make "runBlocking-heavy" case performant, if code for some reason ends up in such situation, it is kinda already can't be (the most) performant.

One of the potential issues I see with this approach is that the theoretically unlimited inflation of the coroutine scheduler thread pool caused by runBlocking (even if it happens only occasionally) may hit performance of CPU workers due to longer searches in task stealing.

I ran the existing benchmarks and didn't see anything suspicious in the results, but I'm not a jmh guru. Probably it requires some additional research.

The main development branch is here, this one I prepared specifically for MR. I don't expect this to be a real MR to be merged, rather just a reference implementation.

May 23 '24 15:05 vsalavatov

To return to the problem in https://github.com/Kotlin/kotlinx.coroutines/pull/4084#issuecomment-2112234723: it doesn't look like we're shielded from similar scenarios by this PR, correct? The specific problem of a thread holding a lock not being able to wake up won't happen, and runBlocking will always be able to finish what it was going to do—which is nice! Yet a more insidious issue seems to remain:

val dispatcher = Dispatchers.IO.softLimitedParallelism(Runtime.getRuntime().availableProcessors()) // <-- changed

fun doTheThing() {
  runBlocking(dispatcher) { // <-- changed
    repeat(100) {
      launch {
        getService() // and do some stuff with it
      }
    }
  }
}

class Service
suspend fun <T> initializeService(serviceClass: Class<T>): T = TODO() // does not interact with the monitor in any way

private val serviceHolder = lazy(mode = LazyThreadSafetyMode.SYNCHRONIZED) {
  runBlocking(dispatcher) { // <-- changed
    initializeService(Service::class.java)
  }
}

fun getService(): Service = serviceHolder.value

In general, it is possible that all threads available to us will get stuck in a deadlock because all the new threads that runBlocking may request spawning will immediately go towards launching new coroutines and not to actually making progress that would be needed for these coroutines to proceed.

That said, the general idea of having soft limits in addition to hard limits is appealing. It's true that Dispatcher.IO's 64-thread limit is not some universal rule, as opposed to, say, the "at most 120 connections" requirement that could be imposed by some database; it's just protection against thread starvation, something that may be implemented in a more malleable way. Soft limits that allow signalling that the thread is not actually doing all that much feels like a step in the right direction.

Dispatchers.IO's elasticity looks very much like a soft limit as well. I can't say immediately whether this should be the same concept, but I think it's at least worth a thought.

That said, without a more clear rule as to which dispatchers can extend their limits, which can't, in what conditions it happens and why, this behavior is too irregular and "magical" to include. We have to carefully consider why exactly this works (and whether it can be improved to also cover more aspects of the general case of new threads being allocated to compensate for blocked ones but used for things that don't help resolve blocking), formulate clear and predictable rules, and then implement them.

It doesn't mean that the rules have to be conceptually simple, or they won't be implemented, though. Just look at "prompt cancellation" for an example of a complex rule that has to be internalized (and is much tougher to grasp than the comfortable confines of linearizability and atomicity) that we still managed to introduce to the library because this was actually useful.

The bottom line is, I think this version is too conceptually fuzzy to include as is, but I think we should give this approach a chance, carefully consider it, and maybe we can make it work. In any case, thanks for this nice idea!

May 28 '24 17:05 dkhalanskyjb

I agree with you, but would like to share my thoughts on this problem in general.

I don't have any formal claims for the following, but it seems plausible to me. Regarding the problem with synchronized, if we set a goal to build an Ultimate Scheduler Without Starvation At All, then we won't be able to achieve it without the full control over the runtime, because we must be able to react to any call that blocks the thread. This way we'll probably end up reinventing Golang (or Loom, which still hasn't fixed the issue with pinning on object monitors as of today). Maybe Kotlin/Native may provide additional control here though.

Otherwise, we can only hope to provide a "higher order approximations" of that ideal solution. In a sense, this particular patch widens the space of starvation-free Kotlin programs with coroutines by allowing to use runBlocking on dispatcher threads in most cases (or it is what I want to say, but I have no data, so let's say in most simple use cases, or, well, at least now it has a chance to not cause starvation 😃).

But yes, this patch as it is doesn't look like something that should be in the library and the idea probably can be refined further. Thanks for taking a look on this!

May 29 '24 09:05 vsalavatov

And I, in turn, agree with you, but want to highlight that, in a library, predictable behavior is as important as good behavior. Tiny changes to user code shouldn't lead to outsized and strange effects.

May 29 '24 13:05 dkhalanskyjb

kotlinx.coroutines kotlinx.coroutines copied to clipboard

Parallelism Compensation for CoroutineDispatcher and runBlocking

kotlinx.coroutines
kotlinx.coroutines copied to clipboard