kotlinx.serialization icon indicating copy to clipboard operation
kotlinx.serialization copied to clipboard

Give option to reuse byte[] or add apis supporting ByteBuffer

Open vachagan-balayan-bullish opened this issue 1 month ago • 11 comments

It is absurd to me that people to this day do the exact same mistakes every serialization framework did for 20 years. We know at this point 2 biggest allocation sources in almost every application are serialization/deserialization and loggging. For the love of god implement some basic mechanisms to avoid these. You already have done the hardest part.

1) easiest laziest way, add optional argument so i can choose where you allocate byte[] from if you absolutely have to allocate those.

instead of public inline fun <reified T> kotlinx.serialization.BinaryFormat.encodeToByteArray(value: T): kotlin.ByteArray { /* compiled code */ }

add an optional supplier (or a separate method) that allows me to control how those allocations happen, maybe i have a way to reuse byte arrays in a thread safe way.

supplier : (Int)-> ByteArray = { ByteArray(it) }

something like this

inline fun <reified T> kotlinx.serialization.BinaryFormat.encodeToByteArray(
    value: T,
    supplier : (Int)-> ByteArray = { ByteArray(it) }
): Int

inline fun <reified T> kotlinx.serialization.BinaryFormat.decodeFromByteArray(bytes: ByteArray, from: Int, to: Int): T

2) or give an option for the user to supply a buffer


public inline fun <reified T> kotlinx.serialization.BinaryFormat.encodeTo(value: T, buffer: ByteBuffer) {/* compiled code */ }

public inline fun <reified T> kotlinx.serialization.BinaryFormat.decodeFrom(buffer: ByteBuffer): T { /* compiled code */ }
public inline fun <reified T> kotlinx.serialization.BinaryFormat.decodeFrom(buffer: ByteBuffer, supplier: () -> T): T { /* compiled code */ }

its fine if you do some minor allocations while you do the work but by far the biggest garbage is the byte[] or the instances of the objects itself, intermediary things get optimised away by JVM very well.

I think it would be better solved by kotlinx-io. @fzhinkin Do you think we can already use it?

sandwwraith avatar Oct 24 '25 10:10 sandwwraith

@vachagan-balayan-bullish In principle it would be good to avoid allocating buffers etc. The best way would be to encode to streams (that can decide what to do with the data). However, there are limitations due to the multiplatform nature of the library - this is where kotlinx-io would come in.

Note that a callback for allocating a ByteArray would not work as the size of the data is not known ahead of time. ByteBuffers are not available cross-platform (which is where kotlinx-io comes in to provide cross-platform versions of them).

pdvrieze avatar Oct 27 '25 09:10 pdvrieze

Imo you shouldn't make assumptions what the user is using, the most simple api is either byte[], from, to, if you support that they can build streams or whatever they use on top of it. Or at least ByteBuffers. But please don't do streams.

IO streams are much higher level api, you are making assumption that your user is using those, what if i am using something like zmq sockets? or chronicle wire? or aeron UDP... none of them deal with bunch of badly written interfaces from java sdk.

The common language is byte[] or ByteBuffers. Even with byteBuffers its not that trivial Chronicle and Adaptive have their own version of a byteBuffer and they have 0 common interfaces. So ideally its byte[] with offsets.

I love that kotlinx-serialization is a proper library that i do not need a bunch of other fancy things to just get simple serialization work, i love that it does the work at compile time. Lets keep it that way and just give your users options.

I'm sure they will do all sorts of streams and whatever on top of reusable ByteBuffers and reusable instances... and performance will be crazy.

The problem with ByteArray is that you can't reliably make reasonable encoding API because you can't know array size in advance to allocate. And ByteBuffers are JVM-only.

kotlinx-io provides reasonable KMP Buffer implementation, so in terms of abstraction level it will be the same

sandwwraith avatar Nov 06 '25 05:11 sandwwraith

Well i'm not sure how the current implementations are done but i imagine you have some bytebuffer that you control/create/reuse (threadlocal) when user invokes encodeToByteArray(value: T), then you dump its contents in to new byte[] and return it.

So instead of doing new byte array, just ask my code if i have one for you? Here is the interface i proposed (feel free to improve/modify suggest better)

inline fun <reified T> BinaryFormat.encodeToByteArray(
    value: T,
    supplier : (Int) -> ByteArray = { ByteArray(it) }
): Int

once you did your internal serialization and now you know the size of the thing, you invoke my supplier with your size supplier(size) and you get a byte array (or bytebuffer) where you write everything you want and you return the actual number of bytes you wrote. This is how most serialization libraries work, with bunch of buffer copies...

But if you are one of those brave souls that want to do zero copy i would say invoke the supplier with a big size from the get go, skip first 4 bytes for the size (or whatever your format is), do the serialisation, go back write the size and return total bytes...

Its my problem as the user to provide you with big enough byte array or byte buffer... at chronicle we have elastic byte buffers which would expand as you use them... so its my problem, i just need the option do make it my problem instead of watching my application produce garbage non stop...

Imo since you do a lot of work at compile time you can distinguish between classes that have variable length fields in them and the ones who dont (true structs), those true structs can be zero copy and take advantage of full reuse (assuming all fields are mutable). I would absolutely dump whatever i'm using and use kotlinx serialization if i could have this. Zero garbage, zero copy struct serialization...

So in a nutshel few huge deals when it comes to a serialization library.

  1. simple maintainable format
  2. can it do zero garbage (and this is not just because of gcs, garbage = eviction from caches and lots of bad things)
  3. can it do zero copy

when it comes to kotlinx-serialization 0) 10/10 you nailed this

  1. you can absolutely do this by adding few more overloads of your existing methods and let user supply the reusable buffer and reusable object. if the user can do zero garbage.

  2. you can absolutely do this if the data structure has no var lenght fields, which is imo easy to find out at compile time. So if a data class is fixed length this is faily easy.

Is there some very basic serialiser example for developers who want to contribute/create their own?

I would like to try and implement a very specific binary serializer that only works for very specific data classes.

  • all fields are fixed length
  • all fields are var (no immutability nonsense)
  • fixed length charsequence (define the length via annotation at compile time)
  • support sealed classes
  • no backwards compatibility of any kind (make V2 of your message and deal with migrations at application level)

I want to try and make this zero garbage and zero copy and then compare how much faster my stuff works.

The problem with ByteArray is that you can't reliably make reasonable encoding API because you can't know array size in advance to allocate. And ByteBuffers are JVM-only.

Looking at the code you guys are using a byteArray output stream which literally will just double the array and syscopy every time its not enough, and towards the end copy the correctly sized byte array one last time to return to the user. So you already have that problem.

Multilplatform does not matter, you do the same thing for any platform, byte[] is a concept present in any platform.

Also you cant "do this" in IO layer, its a different problem, you already allocate lots of these in the serialization method itself. Everything else is reusable.

So in a nutshel few huge deals when it comes to a serialization library.

First of all, kotlinx.serialization is a serialization framework that is both data (what you are storing) and format (how you are storing - e.g. Json/XML/Protobuf/whatever) independent.

  1. simple maintainable format

This is to some degree subjective. Depending on your perspective it could also be dependent on the data that you are serializing.

  1. can it do zero garbage (and this is not just because of gcs, garbage = eviction from caches and lots of bad things)

This would be hard to implement as there are plenty of temporary objects. However, escape analysis in the runtime should be able to capture it and elide it in many cases.

  1. can it do zero copy

This is only possible if you can pre-allocate your storage for all dynamically sized elements (read arrays/lists/maps). It can be done, using a two-pass approach. First calculate the needed size, then allocate, then serialize. Unfortunately calculating the needed size is not much less complex than serialization itself. Note that this holds true both for the serialization as well as the deserialization case.

when it comes to kotlinx-serialization 0) 10/10 you nailed this

  1. you can absolutely do this by adding few more overloads of your existing methods and let user supply the reusable buffer and reusable object. if the user can do zero garbage.

There is no fundamental issue why formats wouldn't support writing to buffers of some sorts. The practical reason is multiplatform and the need to support different contexts. This is also where kotlinx.io comes in.

  1. you can absolutely do this if the data structure has no var lenght fields, which is imo easy to find out at compile time. So if a data class is fixed length this is faily easy.

kotlinx.serialization is a generic system and must support variable length data. This would also be fairly common as even strings are variable length. For text formats (such as json) even numeric data is variable length.

Finally I am not clear what the issue with an OutputStream like abstraction is in this context. A buffer would not be the best abstraction as for network context there is no reason to keep the buffer around rather than sending it to the receiver as soon as a full packet is available.

pdvrieze avatar Nov 07 '25 09:11 pdvrieze

By "simple maintainable format" i mean how annoying is it to define and change new types, is it some xml monstrocity or as simple as what you do (slap an annotation and its done by the compiler). Thats what you nailed.

There is 0 reasons why you cant have insane performance of something like flatbuffers or SBE while ease of use of adding an annotation and letting a compiler do the heavy lifting of schemas and serializer code. That would have been ultimate serialisation library (yes from my pserspective its a library and i would only use it as a library, if i can use it without anything forcing me to use kotlinx io or some "framework" then its a library).

I understand you want to target multiplatform, again nothing in what i suggested limits that, you'll just define some common interface of what a bytebuffer is and each platform has its versions of it, but byte[] is something that every platform speaks.

My experience with many projects at this point is that most companies/products start using something convenient, like json in the old days, or protobuf/grpc thesedays and soon probably kotlinx serialization, then as the products mature and performance inevitably starts to really matter they discover how much their serialization hits them. Then they do some benchmarks and inevitably start using something like Flatbuffers, SBE or chronicle wire. Why not implement something that is great at both ease of use and performance?

Given that kotlin multiplatform also targets things like native, performance is going to be a huge deal for those guys, they are going to like the ease of use but hack a different solution that gives them the performance.

I'll try to implement what i suggested using the existing codecs when i have some free time. The hard part is already done in core module.

Thats just my two cents, i just think its a missed opportunity. Because i've seen countless serialization frameworks/formats come and go, the ones that survived to this day are those that give some sort of ease of use while nailing the performance.