kotlinx.coroutines icon indicating copy to clipboard operation
kotlinx.coroutines copied to clipboard

Optimize `CancellableContinuationImpl.invokeOnCancellation(..)` for `Segment`s

Open ndkoval opened this issue 3 years ago • 1 comments
trafficstars

Current semaphore implementation uses Segment-s for storing waiting continuations. Moreover, the upcoming new channels and mutex algorithms also use segments to store waiters. When suspending, a cancellation handler should be provided via cont.invokeOnCancellation { ... }-- it cleans up the corresponding slot in the segment and physically removes this segment from the linked list if it becomes full of cancelled cells. However, this cancellation handler requires an allocation every time.

To reduce the memory pressure, we can store the segment along with the slot index in CancellableContinuationImpl directly, as a cancellation handler instruction; thus, eliminating allocations for the corresponding cancellation handlers. For this purpose, we:

  1. Allow storing Segment in state field, similarly to CancelHandler. On cancellation, Segment.invokeOnCancellation(index, cause) function is called.
  2. Store the slot index in the existing decision integer field, extending its purpose correspondingly.

The benchmark below (see the comment) shows a significant allocation rate reduction.

ndkoval avatar Dec 14 '21 15:12 ndkoval

As semaphore leverages this optimization, I added a simple sequential benchmark to show the impact. The results are below.

WITHOUT optimization:

Benchmark                                                                   Mode  Cnt         Score           Error   Units
SequentialSemaphoreAsMutexBenchmark.benchmark                               avgt   10         0.123 ±         0.007    s/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate                avgt   10       340.849 ±        20.797  MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate.norm           avgt   10  64500711.033 ±         4.138    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space       avgt   10       389.221 ±       620.649  MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space.norm  avgt   10  74868326.400 ± 119718113.259    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.count                     avgt   10         5.000                  counts
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.time                      avgt   10        57.000                      ms

WITH optimization:

Benchmark                                                                   Mode  Cnt         Score           Error   Units
SequentialSemaphoreAsMutexBenchmark.benchmark                               avgt   10         0.123 ±        0.004    s/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate                avgt   10       213.820 ±       10.688  MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate.norm           avgt   10  40500711.033 ±        4.138    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space       avgt   10       157.394 ±      501.946  MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space.norm  avgt   10  30303846.400 ± 96795349.649    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Old_Gen          avgt   10        ≈ 10⁻⁴                 MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Old_Gen.norm     avgt   10        25.778 ±      123.241    B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.count                     avgt   10         2.000                 counts
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.time                      avgt   10        22.000                     ms

ndkoval avatar Dec 14 '21 18:12 ndkoval

Will be delivered along with #3103

ndkoval avatar Nov 28 '22 09:11 ndkoval

Let's keep the separation into two commits: the first that fixes/adds benchmarks, and the second that optimizes the cancellation handling mechanism.

ndkoval avatar Feb 13 '23 14:02 ndkoval

Could you please show before/after on ChannelSinkBenchmark?

qwwdfsad avatar Feb 13 '23 17:02 qwwdfsad

See below the results on my laptop (MacBook Pro 16-inch, 2021, Apple M1 Max, 64 GB, OpenJDK 64-Bit Server VM Zulu19.32+13-CA).

WITHOUT the optimization:

Benchmark                                                                             Mode  Cnt        Score         Error   Units
ChannelSinkBenchmark.channelPipeline                                                  avgt    5        1.375 ±       0.023   ms/op
ChannelSinkBenchmark.channelPipeline:·gc.alloc.rate.norm                              avgt    5   668370.560 ±     256.516    B/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal                                    avgt    5        1.756 ±       0.012   ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal:·gc.alloc.rate.norm                avgt    5   668468.144 ±     296.998    B/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals                                   avgt    5        2.501 ±       0.123   ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals:·gc.alloc.rate.norm               avgt    5  1668726.477 ±     115.551    B/op
ChannelSinkNoAllocationsBenchmark.channelPipeline                                     avgt    5        6.081 ±       0.140   ms/op
ChannelSinkNoAllocationsBenchmark.channelPipeline:·gc.alloc.rate.norm                 avgt    5  3426068.483 ±     354.374    B/op

WITH the optimization:

Benchmark                                                                              Mode  Cnt        Score         Error   Units
ChannelSinkBenchmark.channelPipeline                                                   avgt    5        1.248 ±       0.013   ms/op
ChannelSinkBenchmark.channelPipeline:·gc.alloc.rate.norm                               avgt    5   488344.851 ±     129.550    B/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal                                     avgt    5        1.681 ±       0.031   ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal:·gc.alloc.rate.norm                 avgt    5   488460.940 ±     258.884    B/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals                                    avgt    5        2.518 ±       0.027   ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals:·gc.alloc.rate.norm                avgt    5  1493167.804 ±     116.414    B/op
ChannelSinkNoAllocationsBenchmark.channelPipeline                                      avgt    5        5.971 ±       0.760   ms/op
ChannelSinkNoAllocationsBenchmark.channelPipeline:·gc.alloc.rate.norm                  avgt    5  1025957.479 ±     274.255    B/op    

ndkoval avatar Feb 15 '23 12:02 ndkoval

These benchmarks do not show performance improvement but clearly show reduced allocations (more than 3x on ChannelSinkNoAllocationsBenchmark).

ndkoval avatar Feb 15 '23 13:02 ndkoval

Nice! I'm looking into that

qwwdfsad avatar Feb 15 '23 17:02 qwwdfsad