vert.x icon indicating copy to clipboard operation
vert.x copied to clipboard

Vertx pooled allocator should be the same as Netty

Open franz1981 opened this issue 1 year ago • 19 comments

Currently, in vertx https://github.com/eclipse-vertx/vert.x/blob/adbe97600cc6215f15ce2fac629c89f93618ca8f/src/main/java/io/vertx/core/buffer/impl/VertxByteBufAllocator.java#L25 is NOT reusing the Netty's PooledByteBufAllocator.DEFAULT, causing creation of more thread-local direct buffers and arenas, enlarging the RSS footprint of vertx application, for no reason.

What's the reason behind this choice @vietj?

The reason why it should be changed, is to "ease" the life of users and libraries which allocate Netty direct buffers using the Netty one and can end up allocating new arenas because of this.

If the aforementioned pool re-use the Netty one, clearly is getting some additional contention, but will save memory, which seems a reasonable trade-off.

franz1981 avatar Mar 26 '24 13:03 franz1981

With this as is, impossible to use new adaptive allocator too.

@franz1981 this might affect quarkus too

Context: https://github.com/netty/netty/pull/13976

zekronium avatar Jul 30 '24 03:07 zekronium

The reality seems more complex, see https://github.com/eclipse-vertx/vert.x/blob/7fdc3980c4e70bb71d3b27d471a41f17db581bca/vertx-core/src/main/java/io/vertx/core/net/impl/NetServerImpl.java#L516-L518

in short; with SSL we uses a different allocator (here -> https://github.com/eclipse-vertx/vert.x/blob/master/vertx-core/src/main/java/io/vertx/core/buffer/impl/PartialPooledByteBufAllocator.java) which uses the mentioned custom vertx allocator at https://github.com/eclipse-vertx/vert.x/issues/5168#issue-2208289433, but without SSL, instead, we just use the default Netty allocator.

To enable the adaptive one we should "at least" move into using a different entry point to setup the Netty configured "default" one i.e. https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/ByteBufAllocator.java#L24

franz1981 avatar Jul 30 '24 11:07 franz1981

@vietj @cescoffier this is something to fix on quarkus as well i think

This can affect quarkus as well if we don't reuse the allocator configured on the Netty instance, which could be different from the singleton at PooledByteBufAllocator.DEFAULT, causing to allocate both (and pay twice the RSS).

e.g https://github.com/quarkusio/quarkus/blob/261cc877718ef24dd681cb1f3bb1547208535fca/independent-projects/vertx-utils/src/main/java/io/quarkus/vertx/utils/AppendBuffer.java#L136

@vietj @geoand In the case above what's the "idiomatic" way for quarkus to obtain the configured allocator? I cannot see any chain of calls to get there, TBH.

franz1981 avatar Jul 30 '24 12:07 franz1981

In the case above what's the "idiomatic" way for quarkus to obtain the configured allocator? I cannot see any chain of calls to get there, TBH.

I am not sure I follow this... What exactly are you after here?

geoand avatar Jul 30 '24 12:07 geoand

Yep, so, in short:

  1. based on the SSL configuration Vertx can use its own PartialPooledByteBufAllocator.INSTANCE or the Netty PooledByteBufAllocator.DEFAULT
  2. Quarkus AppendBuffer (and other code paths) just use the Netty one i.e. PooledByteBufAllocator.DEFAULT
  3. If SSL is configured, the RSS of off-heap arenas double, because the vertx and quarkus direct allocators are different

To solve this, we should obtain from vertx itself which configured allocator it uses, so we won't duplicate the RSS, makes sense @geoand ?

franz1981 avatar Jul 30 '24 12:07 franz1981

Sure, that makes sense, but it won't really involve any Quarkus APIs or anything - we just need the extensions code do the proper thing

geoand avatar Jul 30 '24 12:07 geoand

IDK @geoand probably requires some change in the API till Vertx, to make sure we can query the configured allocator to vertx, or you have a better idea in mind? I'm very open, IDK yet how to solve it, TBH

franz1981 avatar Jul 30 '24 12:07 franz1981

Sounds reasonable

geoand avatar Jul 30 '24 12:07 geoand

I've tried a different less invasive approach to this: see https://github.com/eclipse-vertx/vert.x/pull/5262

franz1981 avatar Jul 30 '24 13:07 franz1981

@zekronium are you a quarkus or vertx user? Just to understand If I ask your feedback if you want to give adaptive a shot :)

franz1981 avatar Jul 30 '24 14:07 franz1981

@zekronium are you a quarkus or vertx user? Just to understand If I ask your feedback if you want to give adaptive a shot :)

98% Vert.x user with BoringSSL (tc-native) and JDK 21 (unsafe enabled with opens and useReflection=true) on top of that too. ~Also run virtual threads in some projects, which based of the known benefits of the adaptive pooler might be interesting to see if it helps~

In quarkus currently OpenSSL/BoringSSL is not supported per se from what I've seen in the issues, right?

EDIT: Scratch the Virtual Thread part, if I recall correctly, Vert.x even with VT's performs all the IO on the eventloops and returns you (Unsafe)HeapBuffers, which are unpooled afaik in all cases, which this new allocator does not cover. So I think because the virtual threads wont actually interact with the allocator, the benefit of potentially better performance with VT's wont be measurable in my case if I got that right

zekronium avatar Jul 30 '24 15:07 zekronium

Exactly @zekronium - probably the only benefit you could have is if you do things on your own using the vertx allocator. - on VT. And clearly in Quarkus, because we use it for few things (e.g. jackson json encoding)

franz1981 avatar Jul 30 '24 15:07 franz1981

@franz1981 Is Quarkus jackson json encoding different to the Vert.x one?

zekronium avatar Jul 30 '24 15:07 zekronium

Yes @zekronium and will get better with time, like the wine :) see https://github.com/quarkusio/quarkus/pull/41063

where we bytecode generate the serializers to save reflection ;)

franz1981 avatar Jul 30 '24 15:07 franz1981

Ah, reminds me of jsoniter-dynamic, they do similar things but its old and unmaintained now. Waiting for backport :)

zekronium avatar Jul 30 '24 15:07 zekronium

@zekronium are you a quarkus or vertx user? Just to understand If I ask your feedback if you want to give adaptive a shot :)

I have benchmarked our client application. Which method would you suggest to primarily look at as a common hotpath for both adaptive and default that would clearly direct comparison

For apples to apples, which is newDirectBuffer, which I think is something well to look at since it universally covers IO with our settings:

image image

This test was ran as a simple PoC with quite light load for 2 minutes. With it it seems to be about ~3x improvement!

These parts seem to consume slightly more CPU time, but might be by design?

image image

zekronium avatar Jul 31 '24 04:07 zekronium

To better understand @zekronium :

  1. what's the type of benchmark you run?
  2. the throughput/rps for the 2 tests (without adaptive/with adaptive) was?
  3. what's the profiler you've used to get such numbers in https://github.com/eclipse-vertx/vert.x/issues/5168#issuecomment-2259633047? What's the meaning of these numbers/which resource is?
  4. any figure of the RSS/actual direct memory used?

franz1981 avatar Jul 31 '24 07:07 franz1981

This was a simple “feeler” test I ran using our request client for two minutes. I simply attached intelij profiler on my m1 max mac.

It was a realistic load run, about 1k rps out, 1k connections. Rough usage was about 2gb with unsafe.

I will comeback with a proper benchmark on proper hardware, but I hope this info is enough for now

zekronium avatar Jul 31 '24 07:07 zekronium

About your finding on the cost of buffer operations, it is expected, because the adaptive buffers piggyback to the usual Netty ones, but still way too much. I will run some micro on Netty itself to see if we can improve there and verify it.

franz1981 avatar Aug 01 '24 06:08 franz1981