httpcomponents-core Add pooled ByteBuffer allocator with size classes

This is the first step towards a pluggable pooled ByteBuffer allocator. The patch adds PooledByteBufferAllocator (power-of-two size buckets, global pool + per-thread caches) and switches the HTTP/2 FrameFactory and the benchmark to use ByteBufferAllocator. Behaviour is unchanged except for using pooled buffers for small control frames. If this direction looks reasonable, I’ll follow up by threading the allocator through IOSession/SSLIOSession and the async codecs and add minimal metrics; otherwise I’ll keep it local to HTTP/2.

ByteBuffer allocator throughput (JMH)

Benchmark	bufferSize	iterations	Mode	Cnt	Score (ops/ms)	Error (ops/ms)
pooled_allocator_shared	1024	100	thrpt	10	1644.982	14.006
pooled_allocator_shared	8192	100	thrpt	10	533.638	34.307
pooled_allocator_shared	65536	100	thrpt	10	59.422	0.937
pooled_allocator_thread_local	1024	100	thrpt	10	539.612	5.518
pooled_allocator_thread_local	8192	100	thrpt	10	201.345	4.451
pooled_allocator_thread_local	65536	100	thrpt	10	19.603	0.501
simple_allocator_shared	1024	100	thrpt	10	172.750	4.893
simple_allocator_shared	8192	100	thrpt	10	23.083	0.199
simple_allocator_shared	65536	100	thrpt	10	2.883	0.037
simple_allocator_thread_local	1024	100	thrpt	10	129.873	1.075
simple_allocator_thread_local	8192	100	thrpt	10	21.075	0.088
simple_allocator_thread_local	65536	100	thrpt	10	2.401	0.062

@ok2c WDYT?

Nov 14 '25 19:11 arturobernalg

@arturobernalg I may be wrong but I was under impression that many (if not all) Java frameworks arrived at the same conclusion that memory pooling was counterproductive as of Java 8 given the efficiency of modern garbage collection algorithms. I will run the micro-benchmark locally and look at the results, but it may take me a certain while.

Generally I see no problem with providing pluggable allocators as long as the simple one remains default and you are willing to maintain more complex ones.

@rschmitt do you happen to have an opinion on this matter?

Nov 18 '25 13:11 ok2c

@ok2c I'm going to ask one or two more qualified people for an opinion and get back to you. My understanding is that object pooling can outperform garbage collection, but it's harder to do than you'd think. (There's also the question of what "outperform" means. What are we measuring, tail latencies? CPU overhead? Heap footprint?) Pooled buffers also come with a lot of risks, like increased potential for memory leaks, or security vulnerabilities such as buffer over-reads.

The Javadoc says that the PooledByteBufferAllocator is inspired by Netty's pooled buffer allocator, but which one? In Netty 4.2, they changed the default allocator from the pooled allocator to the AdaptiveByteBufAllocator. What does that mean, exactly? ¯\_(ツ)_/¯ Evidently it may have something to do with virtual threads.

I guess the main concern I have here is the effectiveness of adding buffer pooling retroactively, compared with the cost in code churn. Typically what I see is frameworks or applications that are designed from the ground up to be garbage-free or zero-copy or what have you. I think this proposal would be more persuasive if I knew what we were measuring and what our performance target is, and what the hotspots currently are for ephemeral garbage. Can they be addressed with a minimum of API churn? (I find it's very difficult to thread new parameters deep into HttpComponents; if we implemented pooling, I'd prefer to make it a purely internal optimization, and an implementation detail. We should be more hesitant to increase our API surface area.)

Finally, I think it's a little late in the development cycle for httpcore 5.4 to be considering such a change. Any usage of pooling in the HTTP/2 or TLS or IOReactor implementation should probably be gated behind a system property and considered experimental.

Nov 18 '25 15:11 rschmitt

@olegk @rschmitt I’ve added a small JMH benchmark that exercises the old and new pool under mixed routes with slow discard / expiry. On my machine the segmented pool removes the cross-route stall and slightly improves tail latency while keeping throughput comparable. Happy to adjust the scenario or parameters if you’d like to capture other access patterns.

To clarify the Netty reference: the allocator is conceptually closest to Netty 4.1s

Allocator	Kind	Buffer	Throughput (ops/ms)	Error
pooled_allocator_shared	HEAP	1024	517.697	±7.829
pooled_allocator_shared	DIRECT	1024	527.269	±20.476
pooled_allocator_shared	HEAP	8192	194.948	±1.124
pooled_allocator_shared	DIRECT	8192	222.407	±2.573
pooled_allocator_shared	HEAP	65536	19.387	±0.297
pooled_allocator_shared	DIRECT	65536	18.704	±1.621
pooled_allocator_thread_local	HEAP	1024	519.383	±9.957
pooled_allocator_thread_local	DIRECT	1024	544.220	±11.254
pooled_allocator_thread_local	HEAP	8192	205.072	±2.435
pooled_allocator_thread_local	DIRECT	8192	222.178	±7.711
pooled_allocator_thread_local	HEAP	65536	18.960	±0.172
pooled_allocator_thread_local	DIRECT	65536	18.286	±1.217
simple_allocator_shared	HEAP	1024	150.141	±6.162
simple_allocator_shared	DIRECT	1024	8.553	±5.767
simple_allocator_shared	HEAP	8192	24.545	±0.880
simple_allocator_shared	DIRECT	8192	5.835	±2.174
simple_allocator_shared	HEAP	65536	2.767	±0.162
simple_allocator_shared	DIRECT	65536	2.351	±0.244
simple_allocator_thread_local	HEAP	1024	149.243	±5.933
simple_allocator_thread_local	DIRECT	1024	8.373	±5.096
simple_allocator_thread_local	HEAP	8192	25.226	±1.756
simple_allocator_thread_local	DIRECT	8192	5.665	±2.431
simple_allocator_thread_local	HEAP	65536	2.700	±0.248
simple_allocator_thread_local	DIRECT	65536	2.274	±0.182

Allocator	Kind	Buffer	gc.alloc.rate.norm (B/op)	gc.count	gc.time (ms)
pooled_allocator_shared	HEAP	1024	0.013	≈0	-
pooled_allocator_shared	DIRECT	1024	0.013	≈0	-
pooled_allocator_shared	HEAP	8192	0.035	≈0	-
pooled_allocator_shared	DIRECT	8192	0.031	≈0	-
pooled_allocator_shared	HEAP	65536	0.356	≈0	-
pooled_allocator_shared	DIRECT	65536	0.370	≈0	-
pooled_allocator_thread_local	HEAP	1024	0.013	≈0	-
pooled_allocator_thread_local	DIRECT	1024	0.013	≈0	-
pooled_allocator_thread_local	HEAP	8192	0.034	≈0	-
pooled_allocator_thread_local	DIRECT	8192	0.031	≈0	-
pooled_allocator_thread_local	HEAP	65536	0.364	≈0	-
pooled_allocator_thread_local	DIRECT	65536	0.378	≈0	-
simple_allocator_shared	HEAP	1024	104000.046	100.000	94.000
simple_allocator_shared	DIRECT	1024	13600.926	8.000	4147.000
simple_allocator_shared	HEAP	8192	820800.283	89.000	84.000
simple_allocator_shared	DIRECT	8192	13601.245	20.000	2020.000
simple_allocator_shared	HEAP	65536	6555202.508	81.000	77.000
simple_allocator_shared	DIRECT	65536	13602.957	29.000	252.000
simple_allocator_thread_local	HEAP	1024	104000.046	94.000	90.000
simple_allocator_thread_local	DIRECT	1024	13600.875	8.000	3920.000
simple_allocator_thread_local	HEAP	8192	820800.276	86.000	91.000
simple_allocator_thread_local	DIRECT	8192	13601.321	19.000	1827.000
simple_allocator_thread_local	HEAP	65536	6555202.575	86.000	81.000
simple_allocator_thread_local	DIRECT	65536	13603.057	27.000	272.000

Nov 18 '25 18:11 arturobernalg

I asked Aleksey Shipilëv for his thoughts:

Depends. In a pure allocation benchmark, allocation would likely be on par with reuse. But once you get far from that ideal, awkward things start to happen.

When there is any non-trivial live set in the heap, GC would have to at least visit it every so often; that "so often" is driven by GC frequency, which is driven by allocation rate. Pure allocation speed and pure reclamation cost becomes much less relevant in this scenario -- what else is happenning dominates hard. Generational GCs win you some, but they really only prolong the inevitable.

When objects are allocated, they are nominally zeroed. Under high allocation rate, that is easily the slowest part, think ~10 GB/sec per thread. Re-use often comes with avoiding these cleanups, often at the cost of weaker security posture (leaking data between reused buffers).

For smaller objects, the metadata management (headers, all that fluff) dominates the allocation path performance, and is often logically intermixed with the real work. E.g. you rarely allocate 10M objects just because, there is likely some compute in between. But allocating new byte[BUF_SIZE] (BUF_SIZE=1M defined in another file) is very easy. So hitting (1) and (2) is much easier the larger the object in questions get.

For smaller objects, the pooling overheads become on par with the size of the objects themselves. The calculation for total memory footprint can push the scale in either direction.

For some awkward classes like DirectByteBuffers that have separate cleanup schedule, unbounded allocation is a recipe for a meltdown.

So answer is somewhat along the lines of: Pooling common (small) objects? Nah, too much hassle for too little gain. Pooling large buffers? Yes, that is a common perf optimization. Pooling large buffers with special lifecycle? YES, do not even think about not doing the pooling. For everything in between the answer is somewhere in between.

Here, "special lifecycle" refers to things like finalizers, Cleaners, weak references, etc.; nothing that would apply to a simple byte buffer.

Another interesting point that came up is that if you use heap (non-direct) byte buffers, and if the pool doesn't hold on to byte buffer references while they are leased out, then there is no risk of a memory leak: returning the buffer to the pool is purely an optimization. Since HEAP and DIRECT have near-identical performance, maybe we should just hardcode a pooled heap buffer allocator into key hotspots.

Nov 19 '25 01:11 rschmitt

@rschmitt Thank you so much for such an informative summary. Please convey my gratitude to Aleksey.

One thing bugs me is how big is big? How big should be byte buffers to justify pooling? If it is a couple of MB, then memory pooling may be useful in our case.

Here, "special lifecycle" refers to things like finalizers, Cleaners, weak references, etc.; nothing that would apply to a simple byte buffer.

I think we have objects with "special lifecycle" in the HttpClient Caching module only but they are backed by files and not byte buffers. There is nothing else I can think of.

However, I imagine the classic on async facade may actually qualify as a potential beneficiary of the pooled memory allocator, so I am leaning towards approving this change-set and letting @arturobernalg proceed with further experiments.

What do you think?

Nov 20 '25 20:11 ok2c