kotlinx-io Support ByteBuffer as a backing storage on JVM

trafficstars

java.nio.ByteBuffer is THE data container in Java NIO APIs. Those who need to use features provided only by the NIO APIs (like non-blocking sockets) are doomed to use ByteBuffer for data transferring. Those who need to achieve better performance or use IO interfaces unavailable in Java StdLib will end up using libraries that might roll out their own data containers but usually still allowing to wrap or directly use ByteBuffer (like Netty or Aeron does).

It's possible to wrap a heap-allocated byte array (the backing storage for kotlinx-io segments) into a HeapByteBuffer, but the use of heap buffers comes with a cost. The majority of NIO API calls eventually perform a native call. If such a call (for example, a native wrapper for POSIX write) needs data, then NIO will supply it in the form of DirectByteBuffer or a memory address extracted from the DirectByteBuffer. If a user had provided DirectByteBuffer, then that buffer will be used, but if it was a HeapByteBuffer, then its content will be copied into an internal cached DirectByteBuffer instance and only then passed to the native API. If the buffer is empty, then the copying cost could be neglected, but as the buffer grows, it starts playing a more significant role in overall performance.

Besides performance issues with NIO API, a buffer residing in native memory is a necessity when it comes to implementing Java API for not yet supported native IO APIs such as io_uring, send w/ MSG_ZEROCOPY flag, epoll in the edge-triggering mode, etc. The only available option for allocating such a buffer and using it in a wide range of JVM versions supported by the Kotlin is by using DirectByteBuffer.

Unfortunately, using direct byte buffers is not always an option:

some APIs don't directly support it on JVM (like MessageDigest)
manipulations with the buffer itself works significantly slower on Android

So the only viable option might be to support both byte-arrays and ByteBuffers as a backing storage and provide a way to choose what particular implementation to use when starting an app.

Tasks:

[x] investigate ByteBuffers advantages/need to support it in kotlinx-io
[x] publish results of BB performance investigation
[x] evaluate kotlinx-io performance with DirectByteBuffer
[x] publish performance characteristics of kotlinx-io w/ BB as a backing storage on JVM
[x] refactor the library to allow using different Segment implementations
[x] implement DirectByteBuffer-backed segments
[x] investigate JDK22 MemorySegments usage instead of BB
[x] implement polymorphic segment
[x] port some benchmarks to Android
[x] evaluate baseline performance on Android
[x] evaluate DirectByteBuffers performance on Android
[ ] investigate R8 features/capabilities/issues
[ ] finalize and publish a design
[ ] test-library support for multiple segment types
[ ] tune performance (rewrite UTF8-manipulation routines, for example)

Nov 24 '23 12:11 fzhinkin

The https://github.com/Kotlin/kotlinx-io/issues/135 will be done in the context of this project (at least partially).

Nov 24 '23 12:11 fzhinkin

“manipulations with the buffer itself works significantly slower on Android“ DirectByteBuffer is actually non-movable bytearray allocated on dalvik heap on Android platform,still it will never be copied when GC so we don't need to pay an extra copy for native IO. Any benchmark indicates that we should not use DirectByteBuffer on Android? @fzhinkin

Dec 03 '23 19:12 revonateB0T

@VDostoyevskiy some time ago, I ran several kotilnx-io benchmarks on Android and saw a significant slowdown when DirectByteBuffer was used as a backing storage (compared to the baseline with ByteArray as a backing storage). At first glance, it looked like Art's JIT failed in ByteBuffer's methods inlining. The current plan is to run an extended set of benchmarks on the device to verify the previous observation, I'm hoping to get down to it by the end of the week. I'll publish the results as soon as it's done.

Dec 04 '23 09:12 fzhinkin

Slowly processing the task list.

publish results of BB performance investigation

Benchmarking results published here: https://github.com/fzhinkin/kotlinx-io-supplementary-benchmarks#kotlinx-io-supplementary-benchmarks

tl;dr On JVM, writing or reading direct byte buffer via channel is usually faster then corresponding operation involving byte arrays and java.io streams. The unexpected twist: on Android, byte buffer-based operations are ridiculously fast compared to their array-based java-io counterparts.

Dec 18 '23 17:12 fzhinkin

As it was mentioned, the next step toward deciding if and how byte buffers should be supported is to plug it into the library and run our benchmarks to see how it affects the performance.

All the table listed below are available as Google Docs spreadsheet here: https://docs.google.com/spreadsheets/d/19krIuAKL7zVv8zFMKtUeGtCAZcuqRkPp7QWvZPRa784/edit?usp=sharing

Raw benchmarking results: https://github.com/Kotlin/kotlinx-io/tree/design/dbb/docs/design/byte-buffers/benchmarking

The hypothesis to check is that at least on JVM (or maybe even on Android) we can replace byte arrays storing the data in kotlinx.io.Segment with direct java.nio.ByteBuffer buffers without losing in performance.

I used code from several git branches for the analysis:

develop branch as the baseline
private/segments-public-api branch as an intermediate step before swapping byte arrays with byte buffers; this branch refactors segments and after cleanup and review will be integrated to partially cover #135; this branch replaced explicit access to segments data with API calls and that simplified further integration of byte buffers; disregard being an intemediate branch, I added it to results to show how these particular changes affect overall performance;
private/dbb-benchmarking branch build upon segment-public-api, where segment's byte arrays were replaced with byte buffers;
private/dbb-benchmarking-unsafe - the same as above, but with some operations using Unsafe for reading from and writing into a ByteBuffer (more on that later).

The first two tables below represents results collected using a "core" subset of kotlinx-io benchmarks and their versions ported to androidx-benchmark.

Improvement column contains the speedup relative to the baseline (develop branch-based, in all cases), computed as 100% * (baseline - alternative) / baseline. If a code in alternative branch performs better than baseline, this value is positive, otherwise - negative. N/A means that comparison results are not available (because CIs for mean overlapped and I can't say which result is actually better; yes, it's not the best way to check results' significance).

JVM results

Benchmark	Parameters	Baseline, avg. time	0.999-CI	Seg. public API, avg. time	0.999-CI	Improvement	DirectBB, avg. time	0.999-CI	Improvement	DirectBB w/ Unsafe, avg. time	0.999-CI	Improvement
kx.io.b.BufferReadNewByteArray.benchmark	size=1	8.677 ns	±0.031 ns	8.823 ns	±0.197 ns	N/A	18.803 ns	±0.354 ns	-116.7 %	19.826 ns	±0.994 ns	-128.5 %
kx.io.b.BufferReadNewByteArray.benchmark	size=1024	70.959 ns	±1.791 ns	70.490 ns	±0.055 ns	N/A	74.719 ns	±2.332 ns	N/A	71.761 ns	±1.373 ns	N/A
kx.io.b.BufferReadNewByteArray.benchmark	size=24576	2.092 us	±0.014 us	2.119 us	±0.013 us	-1.3 %	2.126 us	±0.008 us	-1.6 %	2.111 us	±0.006 us	-0.9 %
kx.io.b.BufferReadWriteByteArray.benchmark	size=1	7.618 ns	±0.138 ns	7.446 ns	±0.016 ns	2.3 %	20.003 ns	±0.072 ns	-162.6 %	18.876 ns	±3.276 ns	-147.8 %
kx.io.b.BufferReadWriteByteArray.benchmark	size=1024	32.496 ns	±2.299 ns	30.441 ns	±1.666 ns	N/A	40.562 ns	±2.334 ns	-24.8 %	35.648 ns	±2.632 ns	N/A
kx.io.b.BufferReadWriteByteArray.benchmark	size=24576	654.400 ns	±12.697 ns	670.191 ns	±20.315 ns	N/A	692.064 ns	±33.039 ns	N/A	703.603 ns	±53.146 ns	N/A
kx.io.b.DecimalLongBenchmark.benchmark	value='-9223372036854775806'	78.690 ns	±2.792 ns	48.928 ns	±0.194 ns	37.8 %	58.192 ns	±0.156 ns	26.0 %	50.903 ns	±0.113 ns	35.3 %
kx.io.b.DecimalLongBenchmark.benchmark	value='9223372036854775806'	76.946 ns	±12.448 ns	48.157 ns	±0.094 ns	37.4 %	57.621 ns	±3.014 ns	25.1 %	49.928 ns	±0.234 ns	35.1 %
kx.io.b.DecimalLongBenchmark.benchmark	value='1'	10.490 ns	±0.020 ns	9.207 ns	±0.226 ns	12.2 %	10.912 ns	±0.048 ns	-4.0 %	9.706 ns	±0.155 ns	7.5 %
kx.io.b.HexadecimalLongBenchmark.benchmark	value='9223372036854775806'	49.183 ns	±0.177 ns	33.212 ns	±0.140 ns	32.5 %	37.907 ns	±0.691 ns	22.9 %	32.799 ns	±0.179 ns	33.3 %
kx.io.b.HexadecimalLongBenchmark.benchmark	value='1'	14.334 ns	±0.029 ns	11.669 ns	±0.153 ns	18.6 %	15.609 ns	±0.044 ns	-8.9 %	11.420 ns	±0.055 ns	20.3 %
kx.io.b.IndexOfBenchmark.benchmark	params='128:0:-1'	24.181 ns	±0.051 ns	24.824 ns	±0.111 ns	-2.7 %	35.645 ns	±0.103 ns	-47.4 %	31.842 ns	±0.074 ns	-31.7 %
kx.io.b.IndexOfBenchmark.benchmark	params='128:0:7'	5.989 ns	±0.020 ns	5.769 ns	±0.036 ns	3.7 %	18.053 ns	±15.223 ns	N/A	5.716 ns	±0.033 ns	4.6 %
kx.io.b.IndexOfBenchmark.benchmark	params='128:0:100'	19.668 ns	±0.113 ns	20.692 ns	±0.110 ns	-5.2 %	26.891 ns	±0.035 ns	-36.7 %	26.173 ns	±0.157 ns	-33.1 %
kx.io.b.IndexOfBenchmark.benchmark	params='128:8128:100'	27.298 ns	±0.407 ns	27.097 ns	±0.206 ns	N/A	34.514 ns	±0.254 ns	-26.4 %	31.714 ns	±0.060 ns	-16.2 %
kx.io.b.IndexOfBenchmark.benchmark	params='24576:0:-1'	3.600 us	±0.252 us	3.744 us	±0.010 us	N/A	5.614 us	±0.118 us	-56.0 %	5.599 us	±0.013 us	-55.5 %
kx.io.b.IndexOfByteString.benchmark	params='1024:2'	1.470 us	±0.003 us	1.289 us	±0.013 us	12.4 %	1.351 us	±0.004 us	8.1 %	1.047 us	±0.002 us	28.8 %
kx.io.b.IndexOfByteString.benchmark	params='8192:2'	11.658 us	±0.033 us	10.370 us	±0.036 us	11.0 %	10.642 us	±0.113 us	8.7 %	8.253 us	±0.012 us	29.2 %
kx.io.b.IndexOfByteString.benchmark	params='10000:2'	13.983 us	±0.021 us	12.418 us	±0.089 us	11.2 %	13.180 us	±0.141 us	5.7 %	10.140 us	±0.096 us	27.5 %
kx.io.b.IndexOfByteString.benchmark	params='10000:8'	29.332 us	±0.709 us	26.705 us	±0.078 us	9.0 %	43.945 us	±0.104 us	-49.8 %	25.455 us	±0.055 us	13.2 %
kx.io.b.ByteBenchmark.benchmark		3.230 ns	±0.142 ns	3.034 ns	±0.063 ns	N/A	6.284 ns	±0.060 ns	-94.6 %	3.287 ns	±0.061 ns	N/A
kx.io.b.IntBenchmark.benchmark		3.862 ns	±0.007 ns	3.530 ns	±0.009 ns	8.6 %	4.124 ns	±0.020 ns	-6.8 %	4.102 ns	±0.015 ns	-6.2 %
kx.io.b.IntLeBenchmark.benchmark		4.072 ns	±0.025 ns	3.741 ns	±0.011 ns	8.1 %	4.206 ns	±0.025 ns	-3.3 %	4.208 ns	±0.038 ns	-3.3 %
kx.io.b.LongBenchmark.benchmark		6.068 ns	±0.051 ns	5.284 ns	±0.043 ns	12.9 %	4.101 ns	±0.014 ns	32.4 %	4.138 ns	±0.013 ns	31.8 %
kx.io.b.LongLeBenchmark.benchmark		6.843 ns	±0.096 ns	6.283 ns	±0.016 ns	8.2 %	4.977 ns	±0.010 ns	27.3 %	5.158 ns	±0.010 ns	24.6 %
kx.io.b.ShortBenchmark.benchmark		3.333 ns	±0.016 ns	3.334 ns	±0.047 ns	N/A	4.109 ns	±0.004 ns	-23.3 %	4.098 ns	±0.013 ns	-22.9 %
kx.io.b.ShortLeBenchmark.benchmark		3.381 ns	±0.030 ns	3.404 ns	±0.020 ns	N/A	4.134 ns	±0.042 ns	-22.3 %	4.220 ns	±0.140 ns	-24.8 %
kx.io.b.Utf8LineBenchmark.benchmark	length=17, separator='LF'	43.060 ns	±0.096 ns	42.598 ns	±0.851 ns	N/A	52.416 ns	±0.200 ns	-21.7 %	45.684 ns	±0.678 ns	-6.1 %
kx.io.b.Utf8LineBenchmark.benchmark	length=17, separator='CRLF'	44.468 ns	±0.096 ns	43.776 ns	±0.137 ns	1.6 %	51.660 ns	±0.095 ns	-16.2 %	44.960 ns	±1.098 ns	N/A
kx.io.b.Utf8LineStrictBenchmark.benchmark	length=17, separator='LF'	43.517 ns	±0.267 ns	44.060 ns	±0.149 ns	-1.2 %	51.769 ns	±1.095 ns	-19.0 %	45.266 ns	±0.696 ns	-4.0 %
kx.io.b.Utf8LineStrictBenchmark.benchmark	length=17, separator='CRLF'	44.092 ns	±0.165 ns	43.578 ns	±0.362 ns	N/A	52.709 ns	±0.638 ns	-19.5 %	45.587 ns	±0.364 ns	-3.4 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='ascii', length=20	36.375 ns	±0.499 ns	34.739 ns	±0.176 ns	4.5 %	41.216 ns	±0.107 ns	-13.3 %	35.824 ns	±1.050 ns	N/A
kx.io.b.Utf8StringBenchmark.benchmark	encoding='ascii', length=2000	1.576 us	±0.008 us	1.634 us	±0.008 us	-3.7 %	1.755 us	±0.005 us	-11.4 %	1.753 us	±0.006 us	-11.3 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='ascii', length=200000	180.333 us	±0.497 us	181.497 us	±0.270 us	-0.6 %	180.052 us	±0.588 us	N/A	179.222 us	±0.553 us	0.6 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='utf8', length=20	79.311 ns	±3.425 ns	91.642 ns	±0.409 ns	-15.5 %	106.419 ns	±4.981 ns	-34.2 %	85.575 ns	±0.869 ns	-7.9 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='utf8', length=2000	9.013 us	±0.077 us	9.389 us	±0.035 us	-4.2 %	10.150 us	±0.033 us	-12.6 %	8.534 us	±0.088 us	5.3 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='utf8', length=200000	913.628 us	±29.243 us	932.226 us	±22.271 us	N/A	1051.384 us	±4.137 us	-15.1 %	900.200 us	±1.462 us	N/A
kx.io.b.Utf8StringBenchmark.benchmark	encoding='sparse', length=20	54.438 ns	±0.394 ns	53.043 ns	±0.121 ns	2.6 %	63.510 ns	±0.107 ns	-16.7 %	54.060 ns	±0.124 ns	N/A
kx.io.b.Utf8StringBenchmark.benchmark	encoding='sparse', length=2000	2.225 us	±0.011 us	2.137 us	±0.020 us	3.9 %	2.289 us	±0.007 us	-2.9 %	2.329 us	±0.015 us	-4.7 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='sparse', length=200000	234.449 us	±0.697 us	255.164 us	±1.010 us	-8.8 %	229.776 us	±1.495 us	2.0 %	246.223 us	±7.980 us	-5.0 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='2bytes', length=20	144.603 ns	±0.679 ns	99.734 ns	±0.383 ns	31.0 %	110.526 ns	±0.511 ns	23.6 %	98.628 ns	±0.297 ns	31.8 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='2bytes', length=2000	10.409 us	±0.435 us	7.970 us	±0.030 us	23.4 %	8.655 us	±0.272 us	16.8 %	7.650 us	±0.028 us	26.5 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='2bytes', length=200000	1077.305 us	±3.276 us	844.984 us	±1.783 us	21.6 %	865.833 us	±13.570 us	19.6 %	792.607 us	±1.371 us	26.4 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='3bytes', length=20	147.238 ns	±2.388 ns	114.884 ns	±0.582 ns	22.0 %	128.365 ns	±3.056 ns	12.8 %	115.223 ns	±0.667 ns	21.7 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='3bytes', length=2000	11.732 us	±0.019 us	9.599 us	±0.024 us	18.2 %	10.880 us	±0.025 us	7.3 %	9.498 us	±0.018 us	19.0 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='3bytes', length=200000	1204.980 us	±3.702 us	989.178 us	±3.210 us	17.9 %	1.109 ms	±0.001 ms	8.0 %	983.792 us	±1.027 us	18.4 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='4bytes', length=20	93.512 ns	±0.391 ns	82.132 ns	±0.910 ns	12.2 %	97.382 ns	±0.254 ns	-4.1 %	81.331 ns	±0.402 ns	13.0 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='4bytes', length=2000	8.347 us	±0.063 us	7.161 us	±0.032 us	14.2 %	8.130 us	±0.033 us	2.6 %	7.100 us	±0.021 us	14.9 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='4bytes', length=200000	873.595 us	±3.063 us	750.184 us	±2.393 us	14.1 %	813.925 us	±4.698 us	6.8 %	719.284 us	±1.231 us	17.7 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='bad', length=20	95.384 ns	±0.355 ns	101.927 ns	±2.935 ns	-6.9 %	110.901 ns	±1.039 ns	-16.3 %	102.409 ns	±0.212 ns	-7.4 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='bad', length=2000	7.536 us	±0.059 us	8.653 us	±0.310 us	-14.8 %	9.930 us	±0.655 us	-31.8 %	9.156 us	±0.074 us	-21.5 %
kx.io.b.Utf8StringBenchmark.benchmark	encoding='bad', length=200000	793.557 us	±6.575 us	877.934 us	±13.025 us	-10.6 %	951.408 us	±8.023 us	-19.9 %	667.837 us	±1.239 us	15.8 %

segments-public-api branch performs better, or at least not worse, in almost all cases, except string encoding/decoding. In these cases, the slowdown was mostly caused by switching from direct indexation into Segment's array to indirect indexation (where use passes a logical index into a Segment array's span with data and accessor methods adds limit/pos to it; i.e. was fun get(idx) = data[idx], became fun get(idx) = data[pos + idx]). That is something that could be reverted back to direct indexation in exchange to ease of API use.

Unfortunately, the dbb-benchmarking branch showed the significant performance drop in almost all scenarios where Segment's data was accessed at shorter-than-int granularity. There are various factors affecting that result like need to perform type checks on every call inlined ByteBuffer methods (to ensure that a receiver is an instance of DirectByteBuffer), range checks requiring access to byte buffer's state, more code generated for every segment access (for utf8-string encoding it increases registers pressure in JIT-compiled code and leads to more spills/fills emitted).

To shrink the performance gap between byte array and byte buffer based implementations I tried to use Unsafe for accessing a memory region assigned to a DirectByteBuffer (the private/dbb-benchmarking-unsafe). The use of the Unsafe allows to bypass type checks (target type is a Long with an address) and range checks (it's unsafe, right? :) and results it better performance for string encoding/decoding cases (in some cases it now outperforms develop branch). I concentrated on string ops performance and didn't check IndexOf-methods, thus its performance remained poor in that branch.

Android results

For android, I ported most of the core benchmarks to androidx-benchmark (android-benchmarks branch forked from develop), forked corresponding "JVM"-branches and merged androidx-branch into each of them (private/segments-public-api-android and private/dbb-benchmarking-android branches).

Below are results gathered from a device:

Benchmark	Parameters	Baseline, avg. time	0.999-CI	Seg. public API, avg. time	0.999-CI	Improvement	DirectBB, avg. time	0.999-CI	Improvement
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteByteArray	size=1	102.622 ns	±0.060 ns	105.319 ns	±0.067 ns	-2.6 %	262.792 ns	±0.620 ns	-156.1 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteByteArray	size=1024	330.932 ns	±0.471 ns	331.846 ns	±0.938 ns	N/A	469.693 ns	±1.693 ns	-41.9 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteByteArray	size=24576	4.595 us	±0.036 us	4.852 us	±0.041 us	-5.6 %	6.976 us	±0.057 us	-51.8 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteNewByteArray	size=1	188.783 ns	±7.791 ns	190.824 ns	±6.285 ns	N/A	365.952 ns	±22.233 ns	-93.8 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteNewByteArray	size=1024	2.900 us	±0.058 us	2.953 us	±0.046 us	N/A	3.351 us	±0.062 us	-15.6 %
kx.io.b.a.ByteArrayReadWriteBenchmarks.readWriteNewByteArray	size=24576	35.271 us	±0.196 us	35.465 us	±0.310 us	N/A	38.625 us	±0.538 us	-9.5 %
kx.io.b.a.DecimalLongBenchmark.decLongRW	value='-9223372036854775806'	750.906 ns	±0.505 ns	381.942 ns	±0.165 ns	49.1 %	1034.825 ns	±8.274 ns	-37.8 %
kx.io.b.a.DecimalLongBenchmark.decLongRW	value='9223372036854775806'	760.751 ns	±0.397 ns	406.559 ns	±0.214 ns	46.6 %	1073.903 ns	±0.721 ns	-41.2 %
kx.io.b.a.DecimalLongBenchmark.decLongRW	value='1'	124.628 ns	±0.064 ns	118.894 ns	±0.116 ns	4.6 %	185.450 ns	±0.111 ns	-48.8 %
kx.io.b.a.HexadecimalLongBenchmark.hexLongRW	value='9223372036854775806'	544.825 ns	±0.293 ns	279.604 ns	±0.126 ns	48.7 %	704.703 ns	±0.818 ns	-29.3 %
kx.io.b.a.HexadecimalLongBenchmark.hexLongRW	value='1'	163.418 ns	±0.076 ns	218.349 ns	±0.141 ns	-33.6 %	219.523 ns	±0.157 ns	-34.3 %
kx.io.b.a.IndexOfBenchmark.indexOf	params='128:0:-1'	341.251 ns	±0.236 ns	334.579 ns	±0.111 ns	2.0 %	1899.041 ns	±10.569 ns	-456.5 %
kx.io.b.a.IndexOfBenchmark.indexOf	params='128:0:7'	57.001 ns	±0.036 ns	48.672 ns	±0.173 ns	14.6 %	150.282 ns	±0.083 ns	-163.6 %
kx.io.b.a.IndexOfBenchmark.indexOf	params='128:0:100'	274.937 ns	±0.123 ns	267.822 ns	±0.053 ns	2.6 %	1500.502 ns	±1.361 ns	-445.8 %
kx.io.b.a.IndexOfBenchmark.indexOf	params='128:8128:100'	299.373 ns	±0.116 ns	282.945 ns	±0.165 ns	5.5 %	1528.276 ns	±1.129 ns	-410.5 %
kx.io.b.a.IndexOfBenchmark.indexOf	params='24576:0:-1'	57.637 us	±0.023 us	57.624 us	±0.046 us	N/A	355.938 us	±0.144 us	-517.6 %
kx.io.b.a.IndexOfByteString.indexOf	params='1024:2'	15.666 us	±0.103 us	7.041 us	±0.031 us	55.1 %	38.340 us	±0.195 us	-144.7 %
kx.io.b.a.IndexOfByteString.indexOf	params='8192:2'	124.209 us	±0.239 us	54.966 us	±0.024 us	55.7 %	305.591 us	±0.179 us	-146.0 %
kx.io.b.a.IndexOfByteString.indexOf	params='10000:2'	150.761 us	±0.120 us	66.998 us	±0.028 us	55.6 %	372.751 us	±0.290 us	-147.2 %
kx.io.b.a.IndexOfByteString.indexOf	params='10000:8'	250.041 us	±1.042 us	148.644 us	±0.065 us	40.6 %	852.738 us	±1.098 us	-241.0 %
kx.io.b.a.IntegerValuesBenchmark.byteRW		36.208 ns	±0.015 ns	25.294 ns	±0.024 ns	30.1 %	53.608 ns	±0.021 ns	-48.1 %
kx.io.b.a.IntegerValuesBenchmark.intRW		43.176 ns	±0.015 ns	33.737 ns	±0.015 ns	21.9 %	55.921 ns	±0.020 ns	-29.5 %
kx.io.b.a.IntegerValuesBenchmark.intLeRW		42.785 ns	±0.031 ns	33.179 ns	±0.024 ns	22.5 %	54.095 ns	±0.027 ns	-26.4 %
kx.io.b.a.IntegerValuesBenchmark.longLeRW		56.786 ns	±0.042 ns	49.917 ns	±0.486 ns	12.1 %	64.354 ns	±0.067 ns	-13.3 %
kx.io.b.a.IntegerValuesBenchmark.longRW		46.210 ns	±0.024 ns	40.503 ns	±0.052 ns	12.3 %	56.023 ns	±0.036 ns	-21.2 %
kx.io.b.a.IntegerValuesBenchmark.shortLeRW		41.546 ns	±0.023 ns	27.153 ns	±0.024 ns	34.6 %	55.189 ns	±0.025 ns	-32.8 %
kx.io.b.a.IntegerValuesBenchmark.shortRW		42.347 ns	±0.030 ns	26.361 ns	±0.020 ns	37.8 %	57.663 ns	±0.064 ns	-36.2 %
kx.io.b.a.Utf8LineBenchmarks.readLine	length=17, separator='LF'	783.308 ns	±12.860 ns	787.348 ns	±14.717 ns	N/A	1596.158 ns	±115.792 ns	-103.8 %
kx.io.b.a.Utf8LineBenchmarks.readLine	length=17, separator='CRLF'	829.941 ns	±12.269 ns	843.887 ns	±36.394 ns	N/A	1673.793 ns	±79.104 ns	-101.7 %
kx.io.b.a.Utf8LineBenchmarks.readLineStrict	length=17, separator='LF'	780.764 ns	±13.163 ns	802.789 ns	±40.353 ns	N/A	1609.004 ns	±125.967 ns	-106.1 %
kx.io.b.a.Utf8LineBenchmarks.readLineStrict	length=17, separator='CRLF'	893.010 ns	±18.179 ns	844.291 ns	±16.967 ns	5.5 %	1675.612 ns	±32.163 ns	-87.6 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='ascii', length=20	648.041 ns	±13.030 ns	681.464 ns	±12.634 ns	-5.2 %	1297.408 ns	±34.402 ns	-100.2 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='ascii', length=2000	28.898 us	±0.432 us	29.776 us	±0.432 us	-3.0 %	86.367 us	±0.979 us	-198.9 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='ascii', length=200000	2.733 ms	±0.034 ms	2.777 ms	±0.033 ms	N/A	6.401 ms	±0.080 ms	-134.2 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='utf8', length=20	1.047 us	±0.014 us	1.092 us	±0.017 us	-4.3 %	2.078 us	±0.109 us	-98.5 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='utf8', length=2000	83.576 us	±2.004 us	86.883 us	±1.699 us	N/A	189.267 us	±1.966 us	-126.5 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='utf8', length=200000	8.096 ms	±0.162 ms	8.117 ms	±0.140 ms	N/A	15.218 ms	±0.214 ms	-88.0 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='sparse', length=20	752.265 ns	±17.496 ns	788.823 ns	±16.983 ns	-4.9 %	1497.804 ns	±44.346 ns	-99.1 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='sparse', length=2000	28.831 us	±0.568 us	29.218 us	±0.592 us	N/A	88.670 us	±1.441 us	-207.6 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='sparse', length=200000	2.692 ms	±0.045 ms	2.658 ms	±0.032 ms	N/A	6.387 ms	±0.075 ms	-137.3 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='2bytes', length=20	1.159 us	±0.022 us	1.234 us	±0.026 us	-6.4 %	2.316 us	±0.072 us	-99.9 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='2bytes', length=2000	79.157 us	±2.158 us	77.157 us	±1.695 us	N/A	169.439 us	±1.949 us	-114.1 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='2bytes', length=200000	7.409 ms	±0.149 ms	7.276 ms	±0.145 ms	N/A	13.839 ms	±0.167 ms	-86.8 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='3bytes', length=20	1.413 us	±0.026 us	1.449 us	±0.030 us	N/A	3.224 us	±0.127 us	-128.1 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='3bytes', length=2000	99.928 us	±1.888 us	96.950 us	±1.831 us	N/A	227.922 us	±2.983 us	-128.1 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='3bytes', length=200000	9.156 ms	±0.114 ms	9.142 ms	±0.174 ms	N/A	19.725 ms	±0.269 ms	-115.4 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='4bytes', length=20	1.034 us	±0.016 us	1.100 us	±0.061 us	N/A	2.347 us	±0.157 us	-127.0 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='4bytes', length=2000	64.705 us	±1.421 us	65.686 us	±1.500 us	N/A	172.299 us	±2.715 us	-166.3 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='4bytes', length=200000	6.099 ms	±0.086 ms	6.021 ms	±0.095 ms	N/A	13.489 ms	±0.163 ms	-121.2 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='bad', length=20	991.498 ns	±36.972 ns	1035.136 ns	±52.328 ns	N/A	1536.603 ns	±136.426 ns	-55.0 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='bad', length=2000	66.600 us	±1.704 us	68.397 us	±2.087 us	N/A	111.606 us	±2.014 us	-67.6 %
kx.io.b.a.Utf8Benchmark.readWriteString	encoding='bad', length=200000	6.618 ms	±0.125 ms	6.864 ms	±0.169 ms	N/A	9.545 ms	±0.104 ms	-44.2 %

Results suggests that switching to direct byte buffers on Android would lead to a significant performances drop.

Collected results are not in the byte buffers favor (especially on Android), however it might not be as bad as it seems in a context of some particular application. Also, these results correspond to kotlinx.io.Buffer performance and as it was showed previously, direct byte buffers show some performance improvement when it comes to I/O operations.

To check these two statements, I added kotlinx-io support to kotlinx.serialization (to its fork: https://github.com/fzhinkin/kotlinx.serialization) and added benchmarks to see how well kotlinx-io performs in JSON-serialization scenarios (and scenarios where these serialized data is then sent to a file).

It would be fare to blame me for checking one of the worst performing scenarios (string encoding), but JSON is an extremely popular serialization format and its crucial to show good results when using kotlinx-io in the context for JSON-serialization.

Below are results collected for both JVM and Android (Subset of serialization benchmarks ported to androidx-benchmark: https://github.com/fzhinkin/kotlinx-serialization-android-benchmarks) by running benchmarks against aforementioned branches (in fact, there were 4 separate branches where utf8-code-point writing was made public: private/dev-for-serialization, private/public-segments-api-for-serialization, private/dbb-benchmarking-for-serialization and private/dbb-benchmarking-unsafe-for-serialization).

JVM results

Benchmark	Baseline, avg. time	0.999-CI	Seg. public API, avg. time	0.999-CI	Improvement	DirectBB, avg. time	0.999-CI	Improvement	DirectBB w/ Unsafe, avg. time	0.999-CI	Improvement
k.b.j.CitmBenchmark.encodeCitmKotlinxIo	3.012 ms	±0.040 ms	2.711 ms	±0.064 ms	10.0 %	3.587 ms	±0.867 ms	N/A	2.752 ms	±0.056 ms	8.7 %
k.b.j.CitmBenchmark.encodeCitmKotlinxIoFile	3.024 ms	±0.045 ms	2.816 ms	±0.105 ms	6.9 %	3.691 ms	±0.828 ms	N/A	2.768 ms	±0.042 ms	8.5 %
k.b.j.CitmBenchmark.encodeCitmKotlinxIoFileChannel	2.806 ms	±0.039 ms	2.525 ms	±0.033 ms	10.0 %	4.012 ms	±0.101 ms	-43.0 %	2.461 ms	±0.035 ms	12.3 %
k.b.j.CitmBenchmark.encodeCitmKotlinxIoileChannel	2.770 ms	±0.046 ms	2.532 ms	±0.054 ms	8.6 %	3.374 ms	±0.816 ms	N/A	2.486 ms	±0.042 ms	10.3 %
k.b.j.JacksonComparisonBenchmark.kotlinSmallToKotlinxIo	191.203 ns	±2.059 ns	198.177 ns	±3.050 ns	-3.6 %	195.991 ns	±22.415 ns	N/A	172.998 ns	±17.318 ns	N/A
k.b.j.JacksonComparisonBenchmark.kotlinToKotlinxIo	1.826 us	±0.038 us	1.640 us	±0.043 us	10.2 %	1.622 us	±0.066 us	11.2 %	1.765 us	±0.061 us	N/A
k.b.j.JacksonComparisonBenchmark.kotlinToKotlinxIoFile	2.374 us	±0.035 us	2.139 us	±0.044 us	9.9 %	2.028 us	±0.034 us	14.6 %	2.093 us	±0.040 us	11.8 %
k.b.j.JacksonComparisonBenchmark.kotlinToKotlinxIoFileChannel	1.959 us	±0.059 us	1.927 us	±0.007 us	N/A	1.856 us	±0.028 us	5.2 %	1.916 us	±0.019 us	N/A
k.b.j.TwitterBenchmark.encodeTwitterKotlinxIo	147.486 us	±2.530 us	130.995 us	±2.066 us	11.2 %	147.387 us	±7.603 us	N/A	137.207 us	±1.966 us	7.0 %
k.b.j.TwitterBenchmark.encodeTwitterKotlinxIoFile	143.767 us	±0.698 us	137.564 us	±2.993 us	4.3 %	144.946 us	±7.920 us	N/A	135.599 us	±2.992 us	5.7 %
k.b.j.TwitterBenchmark.encodeTwitterKotlinxIoFileChannel	134.166 us	±2.064 us	124.670 us	±2.548 us	7.1 %	130.584 us	±5.718 us	N/A	126.153 us	±1.929 us	6.0 %
k.b.j.TwitterFeedBenchmark.encodeTwitterKotlinxIo	2.064 ms	±0.215 ms	1.916 ms	±0.209 ms	N/A	2.823 ms	±0.570 ms	N/A	2.023 ms	±0.351 ms	N/A
k.b.j.TwitterFeedBenchmark.encodeTwitterKotlinxIoFile	1.914 ms	±0.042 ms	1.894 ms	±0.125 ms	N/A	2.593 ms	±0.891 ms	N/A	1.743 ms	±0.036 ms	9.0 %
k.b.j.TwitterFeedBenchmark.encodeTwitterKotlinxIoFileChannel	1.813 ms	±0.028 ms	1.632 ms	±0.031 ms	10.0 %	2.600 ms	±0.726 ms	-43.5 %	1.683 ms	±0.049 ms	7.2 %

Android results

Benchmark	Baseline, avg. time	0.999-CI	Seg. public API, avg. time	0.999-CI	Improvement	DirectBB, avg. time	0.999-CI	Improvement
o.e.Benchmarks.citm	26.505 ms	±1.471 ms	26.740 ms	±1.812 ms	N/A	35.307 ms	±2.286 ms	-33.2 %
o.e.Benchmarks.citmFile	27.038 ms	±1.647 ms	27.048 ms	±1.709 ms	N/A	35.528 ms	±2.409 ms	-31.4 %
o.e.Benchmarks.citmFileChannel	26.383 ms	±1.389 ms	25.932 ms	±1.253 ms	N/A	33.768 ms	±2.132 ms	-28.0 %
o.e.Benchmarks.twitterMacro	10.769 ms	±0.456 ms	11.089 ms	±0.460 ms	N/A	18.691 ms	±0.581 ms	-73.6 %
o.e.Benchmarks.twitterMacroFile	11.399 ms	±0.431 ms	11.652 ms	±0.463 ms	N/A	19.281 ms	±0.679 ms	-69.2 %
o.e.Benchmarks.twitterMacroFileChannel	11.918 ms	±0.582 ms	11.847 ms	±0.693 ms	N/A	19.183 ms	±1.039 ms	-60.9 %
o.e.Benchmarks.twitter	915.832 us	±40.304 us	926.380 us	±40.005 us	N/A	1546.619 us	±51.164 us	-68.9 %
o.e.Benchmarks.twitterFile	954.142 us	±41.837 us	971.910 us	±39.710 us	N/A	1547.613 us	±56.647 us	-62.2 %
o.e.Benchmarks.twitterFileChannel	854.305 us	±38.670 us	851.514 us	±37.706 us	N/A	1423.378 us	±42.370 us	-66.6 %

On JVM, byte buffer-backed segments performs better only in conjunction with Unsafe-access (and that's a separate topic to discuss), without it there are some scenarios where it's be better as well as scenarios where it's worse. On Android, everything is much easier: byte buffers are always worse, even if the only non-byte buffer based solution is to copy the data (like in *FileChannel benchmarks).

[Instead of] Conclusion

I don't have a particular conclusion about direct byte buffers use on JVM as to squeeze the max performance from it, we have to use unsafe (the sun.misc/jdk.internal one) and its future in JDK is not that bright (and I was not able to beat ByteBuffers with MemorySegments created from it).

For the Android, it seems like there are no benefits from switching to ByteBuffer even though buffer-based I/O (via NIO channels) seems to be much faster compared to I/O operations involving heap-residing containers (but use of off-heap data may still have some benefits).

Jan 08 '24 17:01 fzhinkin

I've also checked if having multiple segment types will affect the performance if only the one type is actually in use (the assumption is that at least the JVM will employ the CHA to avoid redundant type checks).

There's a branch (that won't compile to any target except JVM) private/polymorphic-segments where Segment was turned into an abstract class with two implementations - one with ByteArray inside (based on the private/segments-public-api branch) and another with the ByteBuffer inside (based on private/dbb-benchmarking branch). For the benchmarking purposes, ByteBuffer-backed segments were never loaded during the experiments (verified with class loading logs).

I won't post a large table as above, will just briefly summarize results:

on JVM, there is no significant difference between results gathered for private/segments-public-api and private/polymorphic-segments branches: that's good, presence of ByteBuffer-backed segments won't affect those who don't need them;
on Android, the situation is different: use of polymorphic segments makes performance worse. It's true even w/ R8 applied with a config that allows treating ByteBuffer-backed segments allocation path unreachable.

JVM benchmarking results are here and Android benchmarking results are here.

Feb 01 '24 13:02 fzhinkin

With all that being said about the performance aspect of ByteBuffers support, it's also worth mentioning that ByteBuffers on JVM and native-pointer-based segments on native would help with supporting memory-mapped files. With array-backed segments only, memory-mapped files would require an additional class/interface. With polymorphic segments, we could (not without caveats) wrap a ByteBuffer or mmaped ptr into a segment as a whole.

May 07 '24 11:05 fzhinkin

Probably we can support ByteBuffers on JVM without hurting performance on Android by publishing a multi-release jar with a baseline implementation remaining the same (byte-array backed) but with polymorphic segments and BB-support enabled for, let's say, JDK9 and onwards. Android tooling ignores MRJ-stuff while dexing, so the trick might work. 👿

I don't think it's a solution we should/could stick to, but that could solve an issue.

Jun 10 '24 11:06 fzhinkin

kotlinx-io kotlinx-io copied to clipboard

Support ByteBuffer as a backing storage on JVM

kotlinx-io
kotlinx-io copied to clipboard