scala-offheap icon indicating copy to clipboard operation
scala-offheap copied to clipboard

UTF8 string serialization

Open velvia opened this issue 9 years ago • 8 comments

Strings form a large portion of many objects. Just storing a pointer to the on-heap String object is not a practical way to reduce GC pressure. Instead, how about having a UTF8-based string wrapper class that can offer support for basic operations:

equals()
startsWith()
maybe contains()

other more complex methods can be delegated to the native Java/Scala string class by serializing to a string on-heap on demand, but the above would offer enough support for simple things like HTTP or JSON parsing.

The goal is to allow for basic fast string operations without the expensive conversion and object allocation to serialize UTF8-encoded strings to UTF16-native Java byte format.

velvia avatar Mar 18 '15 16:03 velvia

I think that having support for offheap strings in the API is a great idea. I'm not sure about details of the implementation yet, but I'll update the issue once I have some more concrete thoughts on the topic.

densh avatar Mar 18 '15 23:03 densh

I agree that the conversion to String is indeed expensive and incurs an unnecessary object allocation. Still, you'll have a hard time beating the performance of String#equals() since the JVM has an intrinsic method that uses SSE4.2 instructions to do the comparison. You might be able to use Arrays.equals (which is also intrinsic) but then you'd incur an allocation since you need to create a byte array from off heap memory. I'm curious to see what you come up with. :smile:

andresilva avatar Mar 20 '15 23:03 andresilva

Unsafe has memcopy, too bad it doesn't have memcompare... :(

-Evan "Never doubt that a small group of thoughtful, committed citizens can change the world" - M. Mead

On Mar 20, 2015, at 4:47 PM, André Silva [email protected] wrote:

I agree that the conversion to String is indeed expensive and incurs an unnecessary object allocation. Still, you'll have a hard time beating the performance of String#equals() since the JVM has an intrinsic method that uses SSE4.2 instructions to do the comparison. You might be able to use Arrays.equals (which is also intrinsic) but then you'd incur an allocation since you need to create a byte array from off heap memory. I'm curious to see what you come up with.

— Reply to this email directly or view it on GitHub.

velvia avatar Mar 21 '15 03:03 velvia

JNI might be the answer here. Considering the fact that we don't need to copy any data over (as the data is already effectively allocated in C heap) we wouldn't have much performance overhead. Of course we need to benchmark to validate this.

densh avatar Mar 21 '15 16:03 densh

Hi Denys,

With the jemalloc JNI binding, we can add utility functions as well to expose low level operations from or potentially SIMD instructions. I think for the latter case we might have to be careful as to chipset family for the target platforms. I can dig into some of the hotspot code from openjdk and check their implementation. For now I can put this work into a parallel branch while we flush out the jemalloc binding and just plan to include that in the JNI library that houses jemalloc.

arosenberger avatar Jun 13 '15 17:06 arosenberger

@arosenberger Please don't use GPL code bases as a reference. We use Scala license (3-clause BSD derivative) for our code and can only borrow implementation ideas from software with compatible license. Otherwise we might get in to legal trouble some day even if we don't borrow any code. (Note to self: this really needs to be documented somewhere.)

densh avatar Jun 13 '15 18:06 densh

@arosenberger I think that we need to concentrate on getting 0.1 out before we proceed with this. I'm afraid there are lots of corner cases in string support and it will take a while to get it right.

densh avatar Jun 13 '15 19:06 densh

Thanks for the heads up on the GPL. I'll focus on finishing up jemalloc and adding the ArrayOps methods from the other issues. We can revisit this one down the road.

arosenberger avatar Jun 13 '15 20:06 arosenberger