KAFKA-19967: Reduce GC pressure in tiered storage read path using direct memory buffers
This change addresses high GC pressure by allocating tiered storage fetch buffers in direct (off-heap) memory instead of the JVM heap. When direct memory is exhausted, the system gracefully falls back to heap allocation with a warning.
Problem: During tiered storage reads, heap-allocated buffers bypass young generation and go directly to old generation (humongous allocations). Under high read load, these accumulate rapidly and trigger frequent, expensive G1 Old Generation collections causing significant GC pause times.
Solution:
- Introduced DirectBufferPool that pools direct buffers using WeakReferences, allowing GC to reclaim buffers under memory pressure
- Modified RemoteLogInputStream to use pooled direct buffers instead of per-request heap allocation
- Graceful fallback to heap allocation when direct memory is exhausted
Thanks for the PR, I can see new metrics have been introduced in the PR which falls under Monitoring and should require a KIP (https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals).
Hi @showuon thanks for your review! Pooling heap buffers would reduce allocation frequency but doesn't eliminate GC pressure. In G1GC, objects >32MB (half region size) are "humongous" and skip young generation entirely—they go straight to old gen. Even with pooling, these buffers:
- Get scanned during every GC cycle (even if reused)
- Contribute to heap occupancy that triggers GC
- Can only be collected in expensive mixed/full GCs.
In tiered storage, maxBytes can reach 55MB+ based on replica.fetch.max.bytes and replica.fetch.response.max.bytes. With a 4GB heap at IHOP=35%, just ~25 concurrent fetches (25 × 55MB = 1.375GB) trigger old GC.
Direct buffers move the data off-heap entirely into native memory where GC doesn't see it. We also get zero-copy I/O since the data is already in native memory for socket writes.
To share some test results Heap buffers: Direct buffers:
- GC every ~100ms - GC every 30-40s
- 1.1-1.3GB after GC - 325MB after GC
- 546-689 humongous - ~270 humongous (50% reduction) regions regions
I will remove the metrics changes from this PR so we can focus on the buffer pool implementation.
I've updated the PR without the metrics changes to focus on fixing the issue. Please could you take a look and review. Thank you!