kotlinx-io
kotlinx-io copied to clipboard
Support ByteBuffer as a backing storage on JVM
java.nio.ByteBuffer is THE data container in Java NIO APIs. Those who need to use features provided only by the NIO APIs (like non-blocking sockets) are doomed to use ByteBuffer for data transferring. Those who need to achieve better performance or use IO interfaces unavailable in Java StdLib will end up using libraries that might roll out their own data containers but usually still allowing to wrap or directly use ByteBuffer (like Netty or Aeron does).
It's possible to wrap a heap-allocated byte array (the backing storage for kotlinx-io segments) into a HeapByteBuffer, but the use of heap buffers comes with a cost. The majority of NIO API calls eventually perform a native call. If such a call (for example, a native wrapper for POSIX write) needs data, then NIO will supply it in the form of DirectByteBuffer or a memory address extracted from the DirectByteBuffer. If a user had provided DirectByteBuffer, then that buffer will be used, but if it was a HeapByteBuffer, then its content will be copied into an internal cached DirectByteBuffer instance and only then passed to the native API. If the buffer is empty, then the copying cost could be neglected, but as the buffer grows, it starts playing a more significant role in overall performance.
Besides performance issues with NIO API, a buffer residing in native memory is a necessity when it comes to implementing Java API for not yet supported native IO APIs such as io_uring, send w/ MSG_ZEROCOPY flag, epoll in the edge-triggering mode, etc. The only available option for allocating such a buffer and using it in a wide range of JVM versions supported by the Kotlin is by using DirectByteBuffer.
Unfortunately, using direct byte buffers is not always an option:
- some APIs don't directly support it on JVM (like
MessageDigest) - manipulations with the buffer itself works significantly slower on Android
So the only viable option might be to support both byte-arrays and ByteBuffers as a backing storage and provide a way to choose what particular implementation to use when starting an app.
Tasks:
- [x] investigate ByteBuffers advantages/need to support it in
kotlinx-io - [x] publish results of BB performance investigation
- [x] evaluate
kotlinx-ioperformance withDirectByteBuffer - [x] publish performance characteristics of
kotlinx-iow/ BB as a backing storage on JVM - [x] refactor the library to allow using different
Segmentimplementations - [x] implement
DirectByteBuffer-backed segments - [x] investigate JDK22
MemorySegments usage instead of BB - [x] implement polymorphic segment
- [x] port some benchmarks to Android
- [x] evaluate baseline performance on Android
- [x] evaluate DirectByteBuffers performance on Android
- [ ] investigate R8 features/capabilities/issues
- [ ] finalize and publish a design
- [ ] test-library support for multiple segment types
- [ ] tune performance (rewrite UTF8-manipulation routines, for example)
The https://github.com/Kotlin/kotlinx-io/issues/135 will be done in the context of this project (at least partially).
“manipulations with the buffer itself works significantly slower on Android“ DirectByteBuffer is actually non-movable bytearray allocated on dalvik heap on Android platform,still it will never be copied when GC so we don't need to pay an extra copy for native IO. Any benchmark indicates that we should not use DirectByteBuffer on Android? @fzhinkin
@VDostoyevskiy some time ago, I ran several kotilnx-io benchmarks on Android and saw a significant slowdown when DirectByteBuffer was used as a backing storage (compared to the baseline with ByteArray as a backing storage). At first glance, it looked like Art's JIT failed in ByteBuffer's methods inlining. The current plan is to run an extended set of benchmarks on the device to verify the previous observation, I'm hoping to get down to it by the end of the week. I'll publish the results as soon as it's done.
Slowly processing the task list.
publish results of BB performance investigation
Benchmarking results published here: https://github.com/fzhinkin/kotlinx-io-supplementary-benchmarks#kotlinx-io-supplementary-benchmarks
tl;dr On JVM, writing or reading direct byte buffer via channel is usually faster then corresponding operation involving byte arrays and java.io streams. The unexpected twist: on Android, byte buffer-based operations are ridiculously fast compared to their array-based java-io counterparts.
As it was mentioned, the next step toward deciding if and how byte buffers should be supported is to plug it into the library and run our benchmarks to see how it affects the performance.
All the table listed below are available as Google Docs spreadsheet here: https://docs.google.com/spreadsheets/d/19krIuAKL7zVv8zFMKtUeGtCAZcuqRkPp7QWvZPRa784/edit?usp=sharing
Raw benchmarking results: https://github.com/Kotlin/kotlinx-io/tree/design/dbb/docs/design/byte-buffers/benchmarking
The hypothesis to check is that at least on JVM (or maybe even on Android) we can replace byte arrays storing the data
in kotlinx.io.Segment with direct java.nio.ByteBuffer buffers without losing in performance.
I used code from several git branches for the analysis:
- develop branch as the baseline
- private/segments-public-api branch as an intermediate step before swapping byte arrays with byte buffers; this branch refactors segments and after cleanup and review will be integrated to partially cover #135; this branch replaced explicit access to segments data with API calls and that simplified further integration of byte buffers; disregard being an intemediate branch, I added it to results to show how these particular changes affect overall performance;
- private/dbb-benchmarking branch build upon segment-public-api, where segment's byte arrays were replaced with byte buffers;
- private/dbb-benchmarking-unsafe - the same as above, but with some operations using
Unsafefor reading from and writing into a ByteBuffer (more on that later).
The first two tables below represents results collected using a "core" subset of kotlinx-io benchmarks and their versions ported to androidx-benchmark.
Improvement column contains the speedup relative to the baseline (develop branch-based, in all cases), computed as 100% * (baseline - alternative) / baseline. If a code in alternative branch performs better than baseline, this value is positive, otherwise - negative. N/A means that comparison results are not available (because CIs for mean overlapped and I can't say which result is actually better; yes, it's not the best way to check results' significance).
JVM results
| Benchmark | Parameters | Baseline, avg. time | 0.999-CI | Seg. public API, avg. time | 0.999-CI | Improvement | DirectBB, avg. time | 0.999-CI | Improvement | DirectBB w/ Unsafe, avg. time | 0.999-CI | Improvement |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| kx.io.b.BufferReadNewByteArray.benchmark | size=1 | 8.677 ns | ±0.031 ns | 8.823 ns | ±0.197 ns | N/A | 18.803 ns | ±0.354 ns | -116.7 % | 19.826 ns | ±0.994 ns | -128.5 % |
| kx.io.b.BufferReadNewByteArray.benchmark | size=1024 | 70.959 ns | ±1.791 ns | 70.490 ns | ±0.055 ns | N/A | 74.719 ns | ±2.332 ns | N/A | 71.761 ns | ±1.373 ns | N/A |
| kx.io.b.BufferReadNewByteArray.benchmark | size=24576 | 2.092 us | ±0.014 us | 2.119 us | ±0.013 us | -1.3 % | 2.126 us | ±0.008 us | -1.6 % | 2.111 us | ±0.006 us | -0.9 % |
| kx.io.b.BufferReadWriteByteArray.benchmark | size=1 | 7.618 ns | ±0.138 ns | 7.446 ns | ±0.016 ns | 2.3 % | 20.003 ns | ±0.072 ns | -162.6 % | 18.876 ns | ±3.276 ns | -147.8 % |
| kx.io.b.BufferReadWriteByteArray.benchmark | size=1024 | 32.496 ns | ±2.299 ns | 30.441 ns | ±1.666 ns | N/A | 40.562 ns | ±2.334 ns | -24.8 % | 35.648 ns | ±2.632 ns | N/A |
| kx.io.b.BufferReadWriteByteArray.benchmark | size=24576 | 654.400 ns | ±12.697 ns | 670.191 ns | ±20.315 ns | N/A | 692.064 ns | ±33.039 ns | N/A | 703.603 ns | ±53.146 ns | N/A |
| kx.io.b.DecimalLongBenchmark.benchmark | value='-9223372036854775806' | 78.690 ns | ±2.792 ns | 48.928 ns | ±0.194 ns | 37.8 % | 58.192 ns | ±0.156 ns | 26.0 % | 50.903 ns | ±0.113 ns | 35.3 % |
| kx.io.b.DecimalLongBenchmark.benchmark | value='9223372036854775806' | 76.946 ns | ±12.448 ns | 48.157 ns | ±0.094 ns | 37.4 % | 57.621 ns | ±3.014 ns | 25.1 % | 49.928 ns | ±0.234 ns | 35.1 % |
| kx.io.b.DecimalLongBenchmark.benchmark | value='1' | 10.490 ns | ±0.020 ns | 9.207 ns | ±0.226 ns | 12.2 % | 10.912 ns | ±0.048 ns | -4.0 % | 9.706 ns | ±0.155 ns | 7.5 % |
| kx.io.b.HexadecimalLongBenchmark.benchmark | value='9223372036854775806' | 49.183 ns | ±0.177 ns | 33.212 ns | ±0.140 ns | 32.5 % | 37.907 ns | ±0.691 ns | 22.9 % | 32.799 ns | ±0.179 ns | 33.3 % |
| kx.io.b.HexadecimalLongBenchmark.benchmark | value='1' | 14.334 ns | ±0.029 ns | 11.669 ns | ±0.153 ns | 18.6 % | 15.609 ns | ±0.044 ns | -8.9 % | 11.420 ns | ±0.055 ns | 20.3 % |
| kx.io.b.IndexOfBenchmark.benchmark | params='128:0:-1' | 24.181 ns | ±0.051 ns | 24.824 ns | ±0.111 ns | -2.7 % | 35.645 ns | ±0.103 ns | -47.4 % | 31.842 ns | ±0.074 ns | -31.7 % |
| kx.io.b.IndexOfBenchmark.benchmark | params='128:0:7' | 5.989 ns | ±0.020 ns | 5.769 ns | ±0.036 ns | 3.7 % | 18.053 ns | ±15.223 ns | N/A | 5.716 ns | ±0.033 ns | 4.6 % |
| kx.io.b.IndexOfBenchmark.benchmark | params='128:0:100' | 19.668 ns | ±0.113 ns | 20.692 ns | ±0.110 ns | -5.2 % | 26.891 ns | ±0.035 ns | -36.7 % | 26.173 ns | ±0.157 ns | -33.1 % |
| kx.io.b.IndexOfBenchmark.benchmark | params='128:8128:100' | 27.298 ns | ±0.407 ns | 27.097 ns | ±0.206 ns | N/A | 34.514 ns | ±0.254 ns | -26.4 % | 31.714 ns | ±0.060 ns | -16.2 % |
| kx.io.b.IndexOfBenchmark.benchmark | params='24576:0:-1' | 3.600 us | ±0.252 us | 3.744 us | ±0.010 us | N/A | 5.614 us | ±0.118 us | -56.0 % | 5.599 us | ±0.013 us | -55.5 % |
| kx.io.b.IndexOfByteString.benchmark | params='1024:2' | 1.470 us | ±0.003 us | 1.289 us | ±0.013 us | 12.4 % | 1.351 us | ±0.004 us | 8.1 % | 1.047 us | ±0.002 us | 28.8 % |
| kx.io.b.IndexOfByteString.benchmark | params='8192:2' | 11.658 us | ±0.033 us | 10.370 us | ±0.036 us | 11.0 % | 10.642 us | ±0.113 us | 8.7 % | 8.253 us | ±0.012 us | 29.2 % |
| kx.io.b.IndexOfByteString.benchmark | params='10000:2' | 13.983 us | ±0.021 us | 12.418 us | ±0.089 us | 11.2 % | 13.180 us | ±0.141 us | 5.7 % | 10.140 us | ±0.096 us | 27.5 % |
| kx.io.b.IndexOfByteString.benchmark | params='10000:8' | 29.332 us | ±0.709 us | 26.705 us | ±0.078 us | 9.0 % | 43.945 us | ±0.104 us | -49.8 % | 25.455 us | ±0.055 us | 13.2 % |
| kx.io.b.ByteBenchmark.benchmark | 3.230 ns | ±0.142 ns | 3.034 ns | ±0.063 ns | N/A | 6.284 ns | ±0.060 ns | -94.6 % | 3.287 ns | ±0.061 ns | N/A | |
| kx.io.b.IntBenchmark.benchmark | 3.862 ns | ±0.007 ns | 3.530 ns | ±0.009 ns | 8.6 % | 4.124 ns | ±0.020 ns | -6.8 % | 4.102 ns | ±0.015 ns | -6.2 % | |
| kx.io.b.IntLeBenchmark.benchmark | 4.072 ns | ±0.025 ns | 3.741 ns | ±0.011 ns | 8.1 % | 4.206 ns | ±0.025 ns | -3.3 % | 4.208 ns | ±0.038 ns | -3.3 % | |
| kx.io.b.LongBenchmark.benchmark | 6.068 ns | ±0.051 ns | 5.284 ns | ±0.043 ns | 12.9 % | 4.101 ns | ±0.014 ns | 32.4 % | 4.138 ns | ±0.013 ns | 31.8 % | |
| kx.io.b.LongLeBenchmark.benchmark | 6.843 ns | ±0.096 ns | 6.283 ns | ±0.016 ns | 8.2 % | 4.977 ns | ±0.010 ns | 27.3 % | 5.158 ns | ±0.010 ns | 24.6 % | |
| kx.io.b.ShortBenchmark.benchmark | 3.333 ns | ±0.016 ns | 3.334 ns | ±0.047 ns | N/A | 4.109 ns | ±0.004 ns | -23.3 % | 4.098 ns | ±0.013 ns | -22.9 % | |
| kx.io.b.ShortLeBenchmark.benchmark | 3.381 ns | ±0.030 ns | 3.404 ns | ±0.020 ns | N/A | 4.134 ns | ±0.042 ns | -22.3 % | 4.220 ns | ±0.140 ns | -24.8 % | |
| kx.io.b.Utf8LineBenchmark.benchmark | length=17, separator='LF' | 43.060 ns | ±0.096 ns | 42.598 ns | ±0.851 ns | N/A | 52.416 ns | ±0.200 ns | -21.7 % | 45.684 ns | ±0.678 ns | -6.1 % |
| kx.io.b.Utf8LineBenchmark.benchmark | length=17, separator='CRLF' | 44.468 ns | ±0.096 ns | 43.776 ns | ±0.137 ns | 1.6 % | 51.660 ns | ±0.095 ns | -16.2 % | 44.960 ns | ±1.098 ns | N/A |
| kx.io.b.Utf8LineStrictBenchmark.benchmark | length=17, separator='LF' | 43.517 ns | ±0.267 ns | 44.060 ns | ±0.149 ns | -1.2 % | 51.769 ns | ±1.095 ns | -19.0 % | 45.266 ns | ±0.696 ns | -4.0 % |
| kx.io.b.Utf8LineStrictBenchmark.benchmark | length=17, separator='CRLF' | 44.092 ns | ±0.165 ns | 43.578 ns | ±0.362 ns | N/A | 52.709 ns | ±0.638 ns | -19.5 % | 45.587 ns | ±0.364 ns | -3.4 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='ascii', length=20 | 36.375 ns | ±0.499 ns | 34.739 ns | ±0.176 ns | 4.5 % | 41.216 ns | ±0.107 ns | -13.3 % | 35.824 ns | ±1.050 ns | N/A |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='ascii', length=2000 | 1.576 us | ±0.008 us | 1.634 us | ±0.008 us | -3.7 % | 1.755 us | ±0.005 us | -11.4 % | 1.753 us | ±0.006 us | -11.3 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='ascii', length=200000 | 180.333 us | ±0.497 us | 181.497 us | ±0.270 us | -0.6 % | 180.052 us | ±0.588 us | N/A | 179.222 us | ±0.553 us | 0.6 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='utf8', length=20 | 79.311 ns | ±3.425 ns | 91.642 ns | ±0.409 ns | -15.5 % | 106.419 ns | ±4.981 ns | -34.2 % | 85.575 ns | ±0.869 ns | -7.9 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='utf8', length=2000 | 9.013 us | ±0.077 us | 9.389 us | ±0.035 us | -4.2 % | 10.150 us | ±0.033 us | -12.6 % | 8.534 us | ±0.088 us | 5.3 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='utf8', length=200000 | 913.628 us | ±29.243 us | 932.226 us | ±22.271 us | N/A | 1051.384 us | ±4.137 us | -15.1 % | 900.200 us | ±1.462 us | N/A |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='sparse', length=20 | 54.438 ns | ±0.394 ns | 53.043 ns | ±0.121 ns | 2.6 % | 63.510 ns | ±0.107 ns | -16.7 % | 54.060 ns | ±0.124 ns | N/A |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='sparse', length=2000 | 2.225 us | ±0.011 us | 2.137 us | ±0.020 us | 3.9 % | 2.289 us | ±0.007 us | -2.9 % | 2.329 us | ±0.015 us | -4.7 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='sparse', length=200000 | 234.449 us | ±0.697 us | 255.164 us | ±1.010 us | -8.8 % | 229.776 us | ±1.495 us | 2.0 % | 246.223 us | ±7.980 us | -5.0 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='2bytes', length=20 | 144.603 ns | ±0.679 ns | 99.734 ns | ±0.383 ns | 31.0 % | 110.526 ns | ±0.511 ns | 23.6 % | 98.628 ns | ±0.297 ns | 31.8 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='2bytes', length=2000 | 10.409 us | ±0.435 us | 7.970 us | ±0.030 us | 23.4 % | 8.655 us | ±0.272 us | 16.8 % | 7.650 us | ±0.028 us | 26.5 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='2bytes', length=200000 | 1077.305 us | ±3.276 us | 844.984 us | ±1.783 us | 21.6 % | 865.833 us | ±13.570 us | 19.6 % | 792.607 us | ±1.371 us | 26.4 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='3bytes', length=20 | 147.238 ns | ±2.388 ns | 114.884 ns | ±0.582 ns | 22.0 % | 128.365 ns | ±3.056 ns | 12.8 % | 115.223 ns | ±0.667 ns | 21.7 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='3bytes', length=2000 | 11.732 us | ±0.019 us | 9.599 us | ±0.024 us | 18.2 % | 10.880 us | ±0.025 us | 7.3 % | 9.498 us | ±0.018 us | 19.0 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='3bytes', length=200000 | 1204.980 us | ±3.702 us | 989.178 us | ±3.210 us | 17.9 % | 1.109 ms | ±0.001 ms | 8.0 % | 983.792 us | ±1.027 us | 18.4 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='4bytes', length=20 | 93.512 ns | ±0.391 ns | 82.132 ns | ±0.910 ns | 12.2 % | 97.382 ns | ±0.254 ns | -4.1 % | 81.331 ns | ±0.402 ns | 13.0 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='4bytes', length=2000 | 8.347 us | ±0.063 us | 7.161 us | ±0.032 us | 14.2 % | 8.130 us | ±0.033 us | 2.6 % | 7.100 us | ±0.021 us | 14.9 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='4bytes', length=200000 | 873.595 us | ±3.063 us | 750.184 us | ±2.393 us | 14.1 % | 813.925 us | ±4.698 us | 6.8 % | 719.284 us | ±1.231 us | 17.7 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='bad', length=20 | 95.384 ns | ±0.355 ns | 101.927 ns | ±2.935 ns | -6.9 % | 110.901 ns | ±1.039 ns | -16.3 % | 102.409 ns | ±0.212 ns | -7.4 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='bad', length=2000 | 7.536 us | ±0.059 us | 8.653 us | ±0.310 us | -14.8 % | 9.930 us | ±0.655 us | -31.8 % | 9.156 us | ±0.074 us | -21.5 % |
| kx.io.b.Utf8StringBenchmark.benchmark | encoding='bad', length=200000 | 793.557 us | ±6.575 us | 877.934 us | ±13.025 us | -10.6 % | 951.408 us | ±8.023 us | -19.9 % | 667.837 us | ±1.239 us | 15.8 % |
segments-public-api branch performs better, or at least not worse, in almost all cases, except string encoding/decoding. In these cases, the slowdown was mostly caused by switching from direct indexation into Segment's array to indirect indexation (where use passes a logical index into a Segment array's span with data and accessor methods adds limit/pos to it; i.e. was fun get(idx) = data[idx], became fun get(idx) = data[pos + idx]). That is something that could be reverted back to direct indexation in exchange to ease of API use.
Unfortunately, the dbb-benchmarking branch showed the significant performance drop in almost all scenarios where Segment's data was accessed at shorter-than-int granularity. There are various factors affecting that result like need to perform type checks on every call inlined ByteBuffer methods (to ensure that a receiver is an instance of DirectByteBuffer), range checks requiring access to byte buffer's state, more code generated for every segment access (for utf8-string encoding it increases registers pressure in JIT-compiled code and leads to more spills/fills emitted).
To shrink the performance gap between byte array and byte buffer based implementations I tried to use Unsafe for accessing a memory region assigned to a DirectByteBuffer (the private/dbb-benchmarking-unsafe). The use of the Unsafe allows to bypass type checks (target type is a Long with an address) and range checks (it's unsafe, right? :) and results it better performance for string encoding/decoding cases (in some cases it now outperforms develop branch). I concentrated on string ops performance and didn't check IndexOf-methods, thus its performance remained poor in that branch.
Android results
For android, I ported most of the core benchmarks to androidx-benchmark (android-benchmarks branch forked from develop), forked corresponding "JVM"-branches and merged androidx-branch into each of them (private/segments-public-api-android and private/dbb-benchmarking-android branches).
Below are results gathered from a device:
| Benchmark | Parameters | Baseline, avg. time | 0.999-CI | Seg. public API, avg. time | 0.999-CI | Improvement | DirectBB, avg. time | 0.999-CI | Improvement |
|---|---|---|---|---|---|---|---|---|---|
| kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteByteArray | size=1 | 102.622 ns | ±0.060 ns | 105.319 ns | ±0.067 ns | -2.6 % | 262.792 ns | ±0.620 ns | -156.1 % |
| kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteByteArray | size=1024 | 330.932 ns | ±0.471 ns | 331.846 ns | ±0.938 ns | N/A | 469.693 ns | ±1.693 ns | -41.9 % |
| kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteByteArray | size=24576 | 4.595 us | ±0.036 us | 4.852 us | ±0.041 us | -5.6 % | 6.976 us | ±0.057 us | -51.8 % |
| kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteNewByteArray | size=1 | 188.783 ns | ±7.791 ns | 190.824 ns | ±6.285 ns | N/A | 365.952 ns | ±22.233 ns | -93.8 % |
| kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteNewByteArray | size=1024 | 2.900 us | ±0.058 us | 2.953 us | ±0.046 us | N/A | 3.351 us | ±0.062 us | -15.6 % |
| kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteNewByteArray | size=24576 | 35.271 us | ±0.196 us | 35.465 us | ±0.310 us | N/A | 38.625 us | ±0.538 us | -9.5 % |
| kx.io.b.a.DecimalLongBenchmark.decLongRW | value='-9223372036854775806' | 750.906 ns | ±0.505 ns | 381.942 ns | ±0.165 ns | 49.1 % | 1034.825 ns | ±8.274 ns | -37.8 % |
| kx.io.b.a.DecimalLongBenchmark.decLongRW | value='9223372036854775806' | 760.751 ns | ±0.397 ns | 406.559 ns | ±0.214 ns | 46.6 % | 1073.903 ns | ±0.721 ns | -41.2 % |
| kx.io.b.a.DecimalLongBenchmark.decLongRW | value='1' | 124.628 ns | ±0.064 ns | 118.894 ns | ±0.116 ns | 4.6 % | 185.450 ns | ±0.111 ns | -48.8 % |
| kx.io.b.a.HexadecimalLongBenchmark.hexLongRW | value='9223372036854775806' | 544.825 ns | ±0.293 ns | 279.604 ns | ±0.126 ns | 48.7 % | 704.703 ns | ±0.818 ns | -29.3 % |
| kx.io.b.a.HexadecimalLongBenchmark.hexLongRW | value='1' | 163.418 ns | ±0.076 ns | 218.349 ns | ±0.141 ns | -33.6 % | 219.523 ns | ±0.157 ns | -34.3 % |
| kx.io.b.a.IndexOfBenchmark.indexOf | params='128:0:-1' | 341.251 ns | ±0.236 ns | 334.579 ns | ±0.111 ns | 2.0 % | 1899.041 ns | ±10.569 ns | -456.5 % |
| kx.io.b.a.IndexOfBenchmark.indexOf | params='128:0:7' | 57.001 ns | ±0.036 ns | 48.672 ns | ±0.173 ns | 14.6 % | 150.282 ns | ±0.083 ns | -163.6 % |
| kx.io.b.a.IndexOfBenchmark.indexOf | params='128:0:100' | 274.937 ns | ±0.123 ns | 267.822 ns | ±0.053 ns | 2.6 % | 1500.502 ns | ±1.361 ns | -445.8 % |
| kx.io.b.a.IndexOfBenchmark.indexOf | params='128:8128:100' | 299.373 ns | ±0.116 ns | 282.945 ns | ±0.165 ns | 5.5 % | 1528.276 ns | ±1.129 ns | -410.5 % |
| kx.io.b.a.IndexOfBenchmark.indexOf | params='24576:0:-1' | 57.637 us | ±0.023 us | 57.624 us | ±0.046 us | N/A | 355.938 us | ±0.144 us | -517.6 % |
| kx.io.b.a.IndexOfByteString.indexOf | params='1024:2' | 15.666 us | ±0.103 us | 7.041 us | ±0.031 us | 55.1 % | 38.340 us | ±0.195 us | -144.7 % |
| kx.io.b.a.IndexOfByteString.indexOf | params='8192:2' | 124.209 us | ±0.239 us | 54.966 us | ±0.024 us | 55.7 % | 305.591 us | ±0.179 us | -146.0 % |
| kx.io.b.a.IndexOfByteString.indexOf | params='10000:2' | 150.761 us | ±0.120 us | 66.998 us | ±0.028 us | 55.6 % | 372.751 us | ±0.290 us | -147.2 % |
| kx.io.b.a.IndexOfByteString.indexOf | params='10000:8' | 250.041 us | ±1.042 us | 148.644 us | ±0.065 us | 40.6 % | 852.738 us | ±1.098 us | -241.0 % |
| kx.io.b.a.IntegerValuesBenchmark.byteRW | 36.208 ns | ±0.015 ns | 25.294 ns | ±0.024 ns | 30.1 % | 53.608 ns | ±0.021 ns | -48.1 % | |
| kx.io.b.a.IntegerValuesBenchmark.intRW | 43.176 ns | ±0.015 ns | 33.737 ns | ±0.015 ns | 21.9 % | 55.921 ns | ±0.020 ns | -29.5 % | |
| kx.io.b.a.IntegerValuesBenchmark.intLeRW | 42.785 ns | ±0.031 ns | 33.179 ns | ±0.024 ns | 22.5 % | 54.095 ns | ±0.027 ns | -26.4 % | |
| kx.io.b.a.IntegerValuesBenchmark.longLeRW | 56.786 ns | ±0.042 ns | 49.917 ns | ±0.486 ns | 12.1 % | 64.354 ns | ±0.067 ns | -13.3 % | |
| kx.io.b.a.IntegerValuesBenchmark.longRW | 46.210 ns | ±0.024 ns | 40.503 ns | ±0.052 ns | 12.3 % | 56.023 ns | ±0.036 ns | -21.2 % | |
| kx.io.b.a.IntegerValuesBenchmark.shortLeRW | 41.546 ns | ±0.023 ns | 27.153 ns | ±0.024 ns | 34.6 % | 55.189 ns | ±0.025 ns | -32.8 % | |
| kx.io.b.a.IntegerValuesBenchmark.shortRW | 42.347 ns | ±0.030 ns | 26.361 ns | ±0.020 ns | 37.8 % | 57.663 ns | ±0.064 ns | -36.2 % | |
| kx.io.b.a.Utf8LineBenchmarks.readLine | length=17, separator='LF' | 783.308 ns | ±12.860 ns | 787.348 ns | ±14.717 ns | N/A | 1596.158 ns | ±115.792 ns | -103.8 % |
| kx.io.b.a.Utf8LineBenchmarks.readLine | length=17, separator='CRLF' | 829.941 ns | ±12.269 ns | 843.887 ns | ±36.394 ns | N/A | 1673.793 ns | ±79.104 ns | -101.7 % |
| kx.io.b.a.Utf8LineBenchmarks.readLineStrict | length=17, separator='LF' | 780.764 ns | ±13.163 ns | 802.789 ns | ±40.353 ns | N/A | 1609.004 ns | ±125.967 ns | -106.1 % |
| kx.io.b.a.Utf8LineBenchmarks.readLineStrict | length=17, separator='CRLF' | 893.010 ns | ±18.179 ns | 844.291 ns | ±16.967 ns | 5.5 % | 1675.612 ns | ±32.163 ns | -87.6 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='ascii', length=20 | 648.041 ns | ±13.030 ns | 681.464 ns | ±12.634 ns | -5.2 % | 1297.408 ns | ±34.402 ns | -100.2 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='ascii', length=2000 | 28.898 us | ±0.432 us | 29.776 us | ±0.432 us | -3.0 % | 86.367 us | ±0.979 us | -198.9 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='ascii', length=200000 | 2.733 ms | ±0.034 ms | 2.777 ms | ±0.033 ms | N/A | 6.401 ms | ±0.080 ms | -134.2 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='utf8', length=20 | 1.047 us | ±0.014 us | 1.092 us | ±0.017 us | -4.3 % | 2.078 us | ±0.109 us | -98.5 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='utf8', length=2000 | 83.576 us | ±2.004 us | 86.883 us | ±1.699 us | N/A | 189.267 us | ±1.966 us | -126.5 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='utf8', length=200000 | 8.096 ms | ±0.162 ms | 8.117 ms | ±0.140 ms | N/A | 15.218 ms | ±0.214 ms | -88.0 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='sparse', length=20 | 752.265 ns | ±17.496 ns | 788.823 ns | ±16.983 ns | -4.9 % | 1497.804 ns | ±44.346 ns | -99.1 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='sparse', length=2000 | 28.831 us | ±0.568 us | 29.218 us | ±0.592 us | N/A | 88.670 us | ±1.441 us | -207.6 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='sparse', length=200000 | 2.692 ms | ±0.045 ms | 2.658 ms | ±0.032 ms | N/A | 6.387 ms | ±0.075 ms | -137.3 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='2bytes', length=20 | 1.159 us | ±0.022 us | 1.234 us | ±0.026 us | -6.4 % | 2.316 us | ±0.072 us | -99.9 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='2bytes', length=2000 | 79.157 us | ±2.158 us | 77.157 us | ±1.695 us | N/A | 169.439 us | ±1.949 us | -114.1 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='2bytes', length=200000 | 7.409 ms | ±0.149 ms | 7.276 ms | ±0.145 ms | N/A | 13.839 ms | ±0.167 ms | -86.8 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='3bytes', length=20 | 1.413 us | ±0.026 us | 1.449 us | ±0.030 us | N/A | 3.224 us | ±0.127 us | -128.1 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='3bytes', length=2000 | 99.928 us | ±1.888 us | 96.950 us | ±1.831 us | N/A | 227.922 us | ±2.983 us | -128.1 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='3bytes', length=200000 | 9.156 ms | ±0.114 ms | 9.142 ms | ±0.174 ms | N/A | 19.725 ms | ±0.269 ms | -115.4 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='4bytes', length=20 | 1.034 us | ±0.016 us | 1.100 us | ±0.061 us | N/A | 2.347 us | ±0.157 us | -127.0 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='4bytes', length=2000 | 64.705 us | ±1.421 us | 65.686 us | ±1.500 us | N/A | 172.299 us | ±2.715 us | -166.3 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='4bytes', length=200000 | 6.099 ms | ±0.086 ms | 6.021 ms | ±0.095 ms | N/A | 13.489 ms | ±0.163 ms | -121.2 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='bad', length=20 | 991.498 ns | ±36.972 ns | 1035.136 ns | ±52.328 ns | N/A | 1536.603 ns | ±136.426 ns | -55.0 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='bad', length=2000 | 66.600 us | ±1.704 us | 68.397 us | ±2.087 us | N/A | 111.606 us | ±2.014 us | -67.6 % |
| kx.io.b.a.Utf8Benchmark.readWriteString | encoding='bad', length=200000 | 6.618 ms | ±0.125 ms | 6.864 ms | ±0.169 ms | N/A | 9.545 ms | ±0.104 ms | -44.2 % |
Results suggests that switching to direct byte buffers on Android would lead to a significant performances drop.
Collected results are not in the byte buffers favor (especially on Android), however it might not be as bad as it seems in a context of some particular application. Also, these results correspond to kotlinx.io.Buffer performance and as it was showed previously, direct byte buffers show some performance improvement when it comes to I/O operations.
To check these two statements, I added kotlinx-io support to kotlinx.serialization (to its fork: https://github.com/fzhinkin/kotlinx.serialization) and added benchmarks to see how well kotlinx-io performs in JSON-serialization scenarios (and scenarios where these serialized data is then sent to a file).
It would be fare to blame me for checking one of the worst performing scenarios (string encoding), but JSON is an extremely popular serialization format and its crucial to show good results when using kotlinx-io in the context for JSON-serialization.
Below are results collected for both JVM and Android (Subset of serialization benchmarks ported to androidx-benchmark: https://github.com/fzhinkin/kotlinx-serialization-android-benchmarks) by running benchmarks against aforementioned branches (in fact, there were 4 separate branches where utf8-code-point writing was made public: private/dev-for-serialization, private/public-segments-api-for-serialization, private/dbb-benchmarking-for-serialization and private/dbb-benchmarking-unsafe-for-serialization).
JVM results
| Benchmark | Baseline, avg. time | 0.999-CI | Seg. public API, avg. time | 0.999-CI | Improvement | DirectBB, avg. time | 0.999-CI | Improvement | DirectBB w/ Unsafe, avg. time | 0.999-CI | Improvement |
|---|---|---|---|---|---|---|---|---|---|---|---|
| k.b.j.CitmBenchmark.encodeCitmKotlinxIo | 3.012 ms | ±0.040 ms | 2.711 ms | ±0.064 ms | 10.0 % | 3.587 ms | ±0.867 ms | N/A | 2.752 ms | ±0.056 ms | 8.7 % |
| k.b.j.CitmBenchmark.encodeCitmKotlinxIoFile | 3.024 ms | ±0.045 ms | 2.816 ms | ±0.105 ms | 6.9 % | 3.691 ms | ±0.828 ms | N/A | 2.768 ms | ±0.042 ms | 8.5 % |
| k.b.j.CitmBenchmark.encodeCitmKotlinxIoFileChannel | 2.806 ms | ±0.039 ms | 2.525 ms | ±0.033 ms | 10.0 % | 4.012 ms | ±0.101 ms | -43.0 % | 2.461 ms | ±0.035 ms | 12.3 % |
| k.b.j.CitmBenchmark.encodeCitmKotlinxIoileChannel | 2.770 ms | ±0.046 ms | 2.532 ms | ±0.054 ms | 8.6 % | 3.374 ms | ±0.816 ms | N/A | 2.486 ms | ±0.042 ms | 10.3 % |
| k.b.j.JacksonComparisonBenchmark.kotlinSmallToKotlinxIo | 191.203 ns | ±2.059 ns | 198.177 ns | ±3.050 ns | -3.6 % | 195.991 ns | ±22.415 ns | N/A | 172.998 ns | ±17.318 ns | N/A |
| k.b.j.JacksonComparisonBenchmark.kotlinToKotlinxIo | 1.826 us | ±0.038 us | 1.640 us | ±0.043 us | 10.2 % | 1.622 us | ±0.066 us | 11.2 % | 1.765 us | ±0.061 us | N/A |
| k.b.j.JacksonComparisonBenchmark.kotlinToKotlinxIoFile | 2.374 us | ±0.035 us | 2.139 us | ±0.044 us | 9.9 % | 2.028 us | ±0.034 us | 14.6 % | 2.093 us | ±0.040 us | 11.8 % |
| k.b.j.JacksonComparisonBenchmark.kotlinToKotlinxIoFileChannel | 1.959 us | ±0.059 us | 1.927 us | ±0.007 us | N/A | 1.856 us | ±0.028 us | 5.2 % | 1.916 us | ±0.019 us | N/A |
| k.b.j.TwitterBenchmark.encodeTwitterKotlinxIo | 147.486 us | ±2.530 us | 130.995 us | ±2.066 us | 11.2 % | 147.387 us | ±7.603 us | N/A | 137.207 us | ±1.966 us | 7.0 % |
| k.b.j.TwitterBenchmark.encodeTwitterKotlinxIoFile | 143.767 us | ±0.698 us | 137.564 us | ±2.993 us | 4.3 % | 144.946 us | ±7.920 us | N/A | 135.599 us | ±2.992 us | 5.7 % |
| k.b.j.TwitterBenchmark.encodeTwitterKotlinxIoFileChannel | 134.166 us | ±2.064 us | 124.670 us | ±2.548 us | 7.1 % | 130.584 us | ±5.718 us | N/A | 126.153 us | ±1.929 us | 6.0 % |
| k.b.j.TwitterFeedBenchmark.encodeTwitterKotlinxIo | 2.064 ms | ±0.215 ms | 1.916 ms | ±0.209 ms | N/A | 2.823 ms | ±0.570 ms | N/A | 2.023 ms | ±0.351 ms | N/A |
| k.b.j.TwitterFeedBenchmark.encodeTwitterKotlinxIoFile | 1.914 ms | ±0.042 ms | 1.894 ms | ±0.125 ms | N/A | 2.593 ms | ±0.891 ms | N/A | 1.743 ms | ±0.036 ms | 9.0 % |
| k.b.j.TwitterFeedBenchmark.encodeTwitterKotlinxIoFileChannel | 1.813 ms | ±0.028 ms | 1.632 ms | ±0.031 ms | 10.0 % | 2.600 ms | ±0.726 ms | -43.5 % | 1.683 ms | ±0.049 ms | 7.2 % |
Android results
| Benchmark | Baseline, avg. time | 0.999-CI | Seg. public API, avg. time | 0.999-CI | Improvement | DirectBB, avg. time | 0.999-CI | Improvement |
|---|---|---|---|---|---|---|---|---|
| o.e.Benchmarks.citm | 26.505 ms | ±1.471 ms | 26.740 ms | ±1.812 ms | N/A | 35.307 ms | ±2.286 ms | -33.2 % |
| o.e.Benchmarks.citmFile | 27.038 ms | ±1.647 ms | 27.048 ms | ±1.709 ms | N/A | 35.528 ms | ±2.409 ms | -31.4 % |
| o.e.Benchmarks.citmFileChannel | 26.383 ms | ±1.389 ms | 25.932 ms | ±1.253 ms | N/A | 33.768 ms | ±2.132 ms | -28.0 % |
| o.e.Benchmarks.twitterMacro | 10.769 ms | ±0.456 ms | 11.089 ms | ±0.460 ms | N/A | 18.691 ms | ±0.581 ms | -73.6 % |
| o.e.Benchmarks.twitterMacroFile | 11.399 ms | ±0.431 ms | 11.652 ms | ±0.463 ms | N/A | 19.281 ms | ±0.679 ms | -69.2 % |
| o.e.Benchmarks.twitterMacroFileChannel | 11.918 ms | ±0.582 ms | 11.847 ms | ±0.693 ms | N/A | 19.183 ms | ±1.039 ms | -60.9 % |
| o.e.Benchmarks.twitter | 915.832 us | ±40.304 us | 926.380 us | ±40.005 us | N/A | 1546.619 us | ±51.164 us | -68.9 % |
| o.e.Benchmarks.twitterFile | 954.142 us | ±41.837 us | 971.910 us | ±39.710 us | N/A | 1547.613 us | ±56.647 us | -62.2 % |
| o.e.Benchmarks.twitterFileChannel | 854.305 us | ±38.670 us | 851.514 us | ±37.706 us | N/A | 1423.378 us | ±42.370 us | -66.6 % |
On JVM, byte buffer-backed segments performs better only in conjunction with Unsafe-access (and that's a separate topic to discuss), without it there are some scenarios where it's be better as well as scenarios where it's worse. On Android, everything is much easier: byte buffers are always worse, even if the only non-byte buffer based solution is to copy the data (like in *FileChannel benchmarks).
[Instead of] Conclusion
I don't have a particular conclusion about direct byte buffers use on JVM as to squeeze the max performance from it, we have to use unsafe (the sun.misc/jdk.internal one) and its future in JDK is not that bright (and I was not able to beat ByteBuffers with MemorySegments created from it).
For the Android, it seems like there are no benefits from switching to ByteBuffer even though buffer-based I/O (via NIO channels) seems to be much faster compared to I/O operations involving heap-residing containers (but use of off-heap data may still have some benefits).
I've also checked if having multiple segment types will affect the performance if only the one type is actually in use (the assumption is that at least the JVM will employ the CHA to avoid redundant type checks).
There's a branch (that won't compile to any target except JVM) private/polymorphic-segments where Segment was turned into an abstract class with two implementations - one with ByteArray inside (based on the private/segments-public-api branch) and another with the ByteBuffer inside (based on private/dbb-benchmarking branch). For the benchmarking purposes, ByteBuffer-backed segments were never loaded during the experiments (verified with class loading logs).
I won't post a large table as above, will just briefly summarize results:
- on JVM, there is no significant difference between results gathered for
private/segments-public-apiandprivate/polymorphic-segmentsbranches: that's good, presence of ByteBuffer-backed segments won't affect those who don't need them; - on Android, the situation is different: use of polymorphic segments makes performance worse. It's true even w/ R8 applied with a config that allows treating ByteBuffer-backed segments allocation path unreachable.
JVM benchmarking results are here and Android benchmarking results are here.
With all that being said about the performance aspect of ByteBuffers support, it's also worth mentioning that ByteBuffers on JVM and native-pointer-based segments on native would help with supporting memory-mapped files. With array-backed segments only, memory-mapped files would require an additional class/interface. With polymorphic segments, we could (not without caveats) wrap a ByteBuffer or mmaped ptr into a segment as a whole.
Probably we can support ByteBuffers on JVM without hurting performance on Android by publishing a multi-release jar with a baseline implementation remaining the same (byte-array backed) but with polymorphic segments and BB-support enabled for, let's say, JDK9 and onwards. Android tooling ignores MRJ-stuff while dexing, so the trick might work. 👿
I don't think it's a solution we should/could stick to, but that could solve an issue.