Add pooled ByteBuffer allocator with size classes
This is the first step towards a pluggable pooled ByteBuffer allocator. The patch adds PooledByteBufferAllocator (power-of-two size buckets, global pool + per-thread caches) and switches the HTTP/2 FrameFactory and the benchmark to use ByteBufferAllocator. Behaviour is unchanged except for using pooled buffers for small control frames. If this direction looks reasonable, I’ll follow up by threading the allocator through IOSession/SSLIOSession and the async codecs and add minimal metrics; otherwise I’ll keep it local to HTTP/2.
ByteBuffer allocator throughput (JMH)
| Benchmark | bufferSize | iterations | Mode | Cnt | Score (ops/ms) | Error (ops/ms) |
|---|---|---|---|---|---|---|
| pooled_allocator_shared | 1024 | 100 | thrpt | 10 | 1644.982 | 14.006 |
| pooled_allocator_shared | 8192 | 100 | thrpt | 10 | 533.638 | 34.307 |
| pooled_allocator_shared | 65536 | 100 | thrpt | 10 | 59.422 | 0.937 |
| pooled_allocator_thread_local | 1024 | 100 | thrpt | 10 | 539.612 | 5.518 |
| pooled_allocator_thread_local | 8192 | 100 | thrpt | 10 | 201.345 | 4.451 |
| pooled_allocator_thread_local | 65536 | 100 | thrpt | 10 | 19.603 | 0.501 |
| simple_allocator_shared | 1024 | 100 | thrpt | 10 | 172.750 | 4.893 |
| simple_allocator_shared | 8192 | 100 | thrpt | 10 | 23.083 | 0.199 |
| simple_allocator_shared | 65536 | 100 | thrpt | 10 | 2.883 | 0.037 |
| simple_allocator_thread_local | 1024 | 100 | thrpt | 10 | 129.873 | 1.075 |
| simple_allocator_thread_local | 8192 | 100 | thrpt | 10 | 21.075 | 0.088 |
| simple_allocator_thread_local | 65536 | 100 | thrpt | 10 | 2.401 | 0.062 |
@ok2c WDYT?
@arturobernalg I may be wrong but I was under impression that many (if not all) Java frameworks arrived at the same conclusion that memory pooling was counterproductive as of Java 8 given the efficiency of modern garbage collection algorithms. I will run the micro-benchmark locally and look at the results, but it may take me a certain while.
Generally I see no problem with providing pluggable allocators as long as the simple one remains default and you are willing to maintain more complex ones.
@rschmitt do you happen to have an opinion on this matter?
@ok2c I'm going to ask one or two more qualified people for an opinion and get back to you. My understanding is that object pooling can outperform garbage collection, but it's harder to do than you'd think. (There's also the question of what "outperform" means. What are we measuring, tail latencies? CPU overhead? Heap footprint?) Pooled buffers also come with a lot of risks, like increased potential for memory leaks, or security vulnerabilities such as buffer over-reads.
The Javadoc says that the PooledByteBufferAllocator is inspired by Netty's pooled buffer allocator, but which one? In Netty 4.2, they changed the default allocator from the pooled allocator to the AdaptiveByteBufAllocator. What does that mean, exactly? ¯\_(ツ)_/¯ Evidently it may have something to do with virtual threads.
I guess the main concern I have here is the effectiveness of adding buffer pooling retroactively, compared with the cost in code churn. Typically what I see is frameworks or applications that are designed from the ground up to be garbage-free or zero-copy or what have you. I think this proposal would be more persuasive if I knew what we were measuring and what our performance target is, and what the hotspots currently are for ephemeral garbage. Can they be addressed with a minimum of API churn? (I find it's very difficult to thread new parameters deep into HttpComponents; if we implemented pooling, I'd prefer to make it a purely internal optimization, and an implementation detail. We should be more hesitant to increase our API surface area.)
Finally, I think it's a little late in the development cycle for httpcore 5.4 to be considering such a change. Any usage of pooling in the HTTP/2 or TLS or IOReactor implementation should probably be gated behind a system property and considered experimental.
@olegk @rschmitt I’ve added a small JMH benchmark that exercises the old and new pool under mixed routes with slow discard / expiry. On my machine the segmented pool removes the cross-route stall and slightly improves tail latency while keeping throughput comparable. Happy to adjust the scenario or parameters if you’d like to capture other access patterns.
To clarify the Netty reference: the allocator is conceptually closest to Netty 4.1s
| Allocator | Kind | Buffer | Throughput (ops/ms) | Error |
|---|---|---|---|---|
| pooled_allocator_shared | HEAP | 1024 | 517.697 | ±7.829 |
| pooled_allocator_shared | DIRECT | 1024 | 527.269 | ±20.476 |
| pooled_allocator_shared | HEAP | 8192 | 194.948 | ±1.124 |
| pooled_allocator_shared | DIRECT | 8192 | 222.407 | ±2.573 |
| pooled_allocator_shared | HEAP | 65536 | 19.387 | ±0.297 |
| pooled_allocator_shared | DIRECT | 65536 | 18.704 | ±1.621 |
| pooled_allocator_thread_local | HEAP | 1024 | 519.383 | ±9.957 |
| pooled_allocator_thread_local | DIRECT | 1024 | 544.220 | ±11.254 |
| pooled_allocator_thread_local | HEAP | 8192 | 205.072 | ±2.435 |
| pooled_allocator_thread_local | DIRECT | 8192 | 222.178 | ±7.711 |
| pooled_allocator_thread_local | HEAP | 65536 | 18.960 | ±0.172 |
| pooled_allocator_thread_local | DIRECT | 65536 | 18.286 | ±1.217 |
| simple_allocator_shared | HEAP | 1024 | 150.141 | ±6.162 |
| simple_allocator_shared | DIRECT | 1024 | 8.553 | ±5.767 |
| simple_allocator_shared | HEAP | 8192 | 24.545 | ±0.880 |
| simple_allocator_shared | DIRECT | 8192 | 5.835 | ±2.174 |
| simple_allocator_shared | HEAP | 65536 | 2.767 | ±0.162 |
| simple_allocator_shared | DIRECT | 65536 | 2.351 | ±0.244 |
| simple_allocator_thread_local | HEAP | 1024 | 149.243 | ±5.933 |
| simple_allocator_thread_local | DIRECT | 1024 | 8.373 | ±5.096 |
| simple_allocator_thread_local | HEAP | 8192 | 25.226 | ±1.756 |
| simple_allocator_thread_local | DIRECT | 8192 | 5.665 | ±2.431 |
| simple_allocator_thread_local | HEAP | 65536 | 2.700 | ±0.248 |
| simple_allocator_thread_local | DIRECT | 65536 | 2.274 | ±0.182 |
| Allocator | Kind | Buffer | gc.alloc.rate.norm (B/op) | gc.count | gc.time (ms) |
|---|---|---|---|---|---|
| pooled_allocator_shared | HEAP | 1024 | 0.013 | ≈0 | - |
| pooled_allocator_shared | DIRECT | 1024 | 0.013 | ≈0 | - |
| pooled_allocator_shared | HEAP | 8192 | 0.035 | ≈0 | - |
| pooled_allocator_shared | DIRECT | 8192 | 0.031 | ≈0 | - |
| pooled_allocator_shared | HEAP | 65536 | 0.356 | ≈0 | - |
| pooled_allocator_shared | DIRECT | 65536 | 0.370 | ≈0 | - |
| pooled_allocator_thread_local | HEAP | 1024 | 0.013 | ≈0 | - |
| pooled_allocator_thread_local | DIRECT | 1024 | 0.013 | ≈0 | - |
| pooled_allocator_thread_local | HEAP | 8192 | 0.034 | ≈0 | - |
| pooled_allocator_thread_local | DIRECT | 8192 | 0.031 | ≈0 | - |
| pooled_allocator_thread_local | HEAP | 65536 | 0.364 | ≈0 | - |
| pooled_allocator_thread_local | DIRECT | 65536 | 0.378 | ≈0 | - |
| simple_allocator_shared | HEAP | 1024 | 104000.046 | 100.000 | 94.000 |
| simple_allocator_shared | DIRECT | 1024 | 13600.926 | 8.000 | 4147.000 |
| simple_allocator_shared | HEAP | 8192 | 820800.283 | 89.000 | 84.000 |
| simple_allocator_shared | DIRECT | 8192 | 13601.245 | 20.000 | 2020.000 |
| simple_allocator_shared | HEAP | 65536 | 6555202.508 | 81.000 | 77.000 |
| simple_allocator_shared | DIRECT | 65536 | 13602.957 | 29.000 | 252.000 |
| simple_allocator_thread_local | HEAP | 1024 | 104000.046 | 94.000 | 90.000 |
| simple_allocator_thread_local | DIRECT | 1024 | 13600.875 | 8.000 | 3920.000 |
| simple_allocator_thread_local | HEAP | 8192 | 820800.276 | 86.000 | 91.000 |
| simple_allocator_thread_local | DIRECT | 8192 | 13601.321 | 19.000 | 1827.000 |
| simple_allocator_thread_local | HEAP | 65536 | 6555202.575 | 86.000 | 81.000 |
| simple_allocator_thread_local | DIRECT | 65536 | 13603.057 | 27.000 | 272.000 |
I asked Aleksey Shipilëv for his thoughts:
Depends. In a pure allocation benchmark, allocation would likely be on par with reuse. But once you get far from that ideal, awkward things start to happen.
- When there is any non-trivial live set in the heap, GC would have to at least visit it every so often; that "so often" is driven by GC frequency, which is driven by allocation rate. Pure allocation speed and pure reclamation cost becomes much less relevant in this scenario -- what else is happenning dominates hard. Generational GCs win you some, but they really only prolong the inevitable.
- When objects are allocated, they are nominally zeroed. Under high allocation rate, that is easily the slowest part, think ~10 GB/sec per thread. Re-use often comes with avoiding these cleanups, often at the cost of weaker security posture (leaking data between reused buffers).
- For smaller objects, the metadata management (headers, all that fluff) dominates the allocation path performance, and is often logically intermixed with the real work. E.g. you rarely allocate 10M objects just because, there is likely some compute in between. But allocating
new byte[BUF_SIZE](BUF_SIZE=1Mdefined in another file) is very easy. So hitting (1) and (2) is much easier the larger the object in questions get.- For smaller objects, the pooling overheads become on par with the size of the objects themselves. The calculation for total memory footprint can push the scale in either direction.
- For some awkward classes like DirectByteBuffers that have separate cleanup schedule, unbounded allocation is a recipe for a meltdown.
So answer is somewhat along the lines of: Pooling common (small) objects? Nah, too much hassle for too little gain. Pooling large buffers? Yes, that is a common perf optimization. Pooling large buffers with special lifecycle? YES, do not even think about not doing the pooling. For everything in between the answer is somewhere in between.
Here, "special lifecycle" refers to things like finalizers, Cleaners, weak references, etc.; nothing that would apply to a simple byte buffer.
Another interesting point that came up is that if you use heap (non-direct) byte buffers, and if the pool doesn't hold on to byte buffer references while they are leased out, then there is no risk of a memory leak: returning the buffer to the pool is purely an optimization. Since HEAP and DIRECT have near-identical performance, maybe we should just hardcode a pooled heap buffer allocator into key hotspots.
@rschmitt Thank you so much for such an informative summary. Please convey my gratitude to Aleksey.
One thing bugs me is how big is big? How big should be byte buffers to justify pooling? If it is a couple of MB, then memory pooling may be useful in our case.
Here, "special lifecycle" refers to things like finalizers, Cleaners, weak references, etc.; nothing that would apply to a simple byte buffer.
I think we have objects with "special lifecycle" in the HttpClient Caching module only but they are backed by files and not byte buffers. There is nothing else I can think of.
However, I imagine the classic on async facade may actually qualify as a potential beneficiary of the pooled memory allocator, so I am leaning towards approving this change-set and letting @arturobernalg proceed with further experiments.
What do you think?