kotlinx.coroutines
kotlinx.coroutines copied to clipboard
Optimize `CancellableContinuationImpl.invokeOnCancellation(..)` for `Segment`s
Current semaphore implementation uses Segment-s for storing waiting continuations. Moreover, the upcoming new channels and mutex algorithms also use segments to store waiters. When suspending, a cancellation handler should be provided via cont.invokeOnCancellation { ... }-- it cleans up the corresponding slot in the segment and physically removes this segment from the linked list if it becomes full of cancelled cells. However, this cancellation handler requires an allocation every time.
To reduce the memory pressure, we can store the segment along with the slot index in CancellableContinuationImpl directly, as a cancellation handler instruction; thus, eliminating allocations for the corresponding cancellation handlers. For this purpose, we:
- Allow storing
Segmentinstatefield, similarly toCancelHandler. On cancellation,Segment.invokeOnCancellation(index, cause)function is called. - Store the slot index in the existing
decisioninteger field, extending its purpose correspondingly.
The benchmark below (see the comment) shows a significant allocation rate reduction.
As semaphore leverages this optimization, I added a simple sequential benchmark to show the impact. The results are below.
WITHOUT optimization:
Benchmark Mode Cnt Score Error Units
SequentialSemaphoreAsMutexBenchmark.benchmark avgt 10 0.123 ± 0.007 s/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate avgt 10 340.849 ± 20.797 MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate.norm avgt 10 64500711.033 ± 4.138 B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space avgt 10 389.221 ± 620.649 MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space.norm avgt 10 74868326.400 ± 119718113.259 B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.count avgt 10 5.000 counts
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.time avgt 10 57.000 ms
WITH optimization:
Benchmark Mode Cnt Score Error Units
SequentialSemaphoreAsMutexBenchmark.benchmark avgt 10 0.123 ± 0.004 s/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate avgt 10 213.820 ± 10.688 MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.alloc.rate.norm avgt 10 40500711.033 ± 4.138 B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space avgt 10 157.394 ± 501.946 MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Eden_Space.norm avgt 10 30303846.400 ± 96795349.649 B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Old_Gen avgt 10 ≈ 10⁻⁴ MB/sec
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.churn.G1_Old_Gen.norm avgt 10 25.778 ± 123.241 B/op
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.count avgt 10 2.000 counts
SequentialSemaphoreAsMutexBenchmark.benchmark:·gc.time avgt 10 22.000 ms
Will be delivered along with #3103
Let's keep the separation into two commits: the first that fixes/adds benchmarks, and the second that optimizes the cancellation handling mechanism.
Could you please show before/after on ChannelSinkBenchmark?
See below the results on my laptop (MacBook Pro 16-inch, 2021, Apple M1 Max, 64 GB, OpenJDK 64-Bit Server VM Zulu19.32+13-CA).
WITHOUT the optimization:
Benchmark Mode Cnt Score Error Units
ChannelSinkBenchmark.channelPipeline avgt 5 1.375 ± 0.023 ms/op
ChannelSinkBenchmark.channelPipeline:·gc.alloc.rate.norm avgt 5 668370.560 ± 256.516 B/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal avgt 5 1.756 ± 0.012 ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal:·gc.alloc.rate.norm avgt 5 668468.144 ± 296.998 B/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals avgt 5 2.501 ± 0.123 ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals:·gc.alloc.rate.norm avgt 5 1668726.477 ± 115.551 B/op
ChannelSinkNoAllocationsBenchmark.channelPipeline avgt 5 6.081 ± 0.140 ms/op
ChannelSinkNoAllocationsBenchmark.channelPipeline:·gc.alloc.rate.norm avgt 5 3426068.483 ± 354.374 B/op
WITH the optimization:
Benchmark Mode Cnt Score Error Units
ChannelSinkBenchmark.channelPipeline avgt 5 1.248 ± 0.013 ms/op
ChannelSinkBenchmark.channelPipeline:·gc.alloc.rate.norm avgt 5 488344.851 ± 129.550 B/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal avgt 5 1.681 ± 0.031 ms/op
ChannelSinkBenchmark.channelPipelineOneThreadLocal:·gc.alloc.rate.norm avgt 5 488460.940 ± 258.884 B/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals avgt 5 2.518 ± 0.027 ms/op
ChannelSinkBenchmark.channelPipelineTwoThreadLocals:·gc.alloc.rate.norm avgt 5 1493167.804 ± 116.414 B/op
ChannelSinkNoAllocationsBenchmark.channelPipeline avgt 5 5.971 ± 0.760 ms/op
ChannelSinkNoAllocationsBenchmark.channelPipeline:·gc.alloc.rate.norm avgt 5 1025957.479 ± 274.255 B/op
These benchmarks do not show performance improvement but clearly show reduced allocations (more than 3x on ChannelSinkNoAllocationsBenchmark).
Nice! I'm looking into that