kotlinx.coroutines icon indicating copy to clipboard operation
kotlinx.coroutines copied to clipboard

Parallelism Compensation for CoroutineDispatcher and runBlocking

Open vsalavatov opened this issue 9 months ago • 3 comments

This is a follow-up on #4084 and #3983 and an implementation of a different approach.

In #4084 the main idea was to try to preserve the parallelism by transferring the computational permits from runBlocking thread and then returning it back. It failed because there are scenarios when we should not let go the permit at all, otherwise we may introduce a deadlock.

In this solution we implement a parallelism compensation mechanism: the main difference with the previous patch is that now runBlocking does not release the computational permit and instead increases the allowed parallelism limit of the associated coroutine dispatcher for the duration of the park() (i.e., while it has nothing to process). An important detail here is that while it is easy to increase the parallelism limit, it is difficult to decrease it back and at the same time make sure that the effective parallelism stays within the limit. In this solution we allow coroutine dispatchers to exceed the instantaneous parallelism limit with the property that they take care of eventually adjusting the effective parallelism to fit to the limit. runBlocking basically can increase the parallelism limit, but it can only request to reduce the parallelism limit back.

Parallelism compensation may potentially cause an unlimited inflation of the scheduler's thread pool, and in this case we are willing to trade thread starvation for OOMs.

It is easy to notice that parallelism compensation cannot be implemented for LimitedDispatcher, because it would break its contract on "hard" parallelism limit. So parallelism compensation is not a general property of coroutine dispatchers, but we can for sure implement it for Dispatchers.Default and Dispatchers.IO. As an alternative to .limitedParallelism, .softLimitedParallelism is introduced which supports parallelism compensation.

Performance considerations:

  • The intuition behind increasing the parallelism limit only for the duration of the park() is that while the thread of runBlocking is parked, it is not actually using its permit, meaning that its thread is not executing and is not taking up a CPU (of course, it is not exactly so, but should be close enough to reality). So with compensation, the "factual" parallelism is expected to be close to the target parallelism.
  • The implementation of "decompensation" in CoroutineScheduler works like this: the worker holding a CPU permit checks if it should let go the permit after it completes each CPU task. If there are pending requests for decompensation, it releases its permit. The "hot path" for CPU workers now includes checking the cpuDecompensationRequests atomic, which is expected to have non-zero value rarely. It is expected to change rather infrequently and thus should have negligible impact on performance in "good" case when there are no runBlockings running on dispatcher threads.
  • There is also an optimisation for the "runBlocking-heavy" case: before increasing the parallelism limit it first checks if it can instead remove one decompensation request. If it succeeds, it means that some CPU worker would just continue to run instead of first releasing the permit and then reacquiring it back (not always exactly so, but it is a good example). It is not a goal though to make "runBlocking-heavy" case performant, if code for some reason ends up in such situation, it is kinda already can't be (the most) performant.

One of the potential issues I see with this approach is that the theoretically unlimited inflation of the coroutine scheduler thread pool caused by runBlocking (even if it happens only occasionally) may hit performance of CPU workers due to longer searches in task stealing.

I ran the existing benchmarks and didn't see anything suspicious in the results, but I'm not a jmh guru. Probably it requires some additional research.

The main development branch is here, this one I prepared specifically for MR. I don't expect this to be a real MR to be merged, rather just a reference implementation.

vsalavatov avatar May 23 '24 15:05 vsalavatov