[FEA] Triple Buffering: Bind Async Resource Budget to Physical Memory Allocation
Is your feature request related to a problem? Please describe.
The current ResourceBoundedExecutor manages asynchronous scanning using a "virtual" budget (Triple Buffer MemManagement) that is loosely coupled with the actual physical lifecycle of off-heap memory.
- Disconnected Lifecycles: We currently rely on complex
DecayReleaseResultcallbacks to manually synchronize the virtual budget release with physical buffer deallocation . This is error-prone and complicates the runner state machine. - Inefficient Budgeting: The initial virtual budget allocation for
MultiThreadFileReaderis often exaggerated because it ignores metadata-based row filters and column pruning . Consequently, we often hold onto a large, unused budget until the very end of the runner's lifecycle, reducing concurrency.
Describe the solution you'd like We propose binding the virtual budget directly to the physical HostMemoryBuffer allocation within the MemoryBoundedAsyncRunner.
-
Runner as the Allocator The
MemoryBoundedAsyncRunnerwill implementHostMemoryAllocator. Instead of the runner requesting a budget and then separately allocating memory, the runner itself will serve as the source of truth. All allocations for the task will pass through the runner, allowing it to tracklocalPool(budget) againstusedMem(physical usage) atomically. -
Precise Deallocation via Event Handlers We will leverage the cuDF
MemoryBuffer.eventHandlermechanism to automate resource release.- Mechanism: When a physical
HostMemoryBufferis closed (refCount reaches 0), the event handler triggers immediately. - Benefit: This allows us to release the virtual budget tracked by
ResourceBoundedExecutorat the exact moment of physical deallocation . - Outcome: We can completely remove the
DecayReleaseResultabstraction and explicit decay callbacks, as the memory management becomes self-sustaining.
- Mechanism: When a physical
-
Early Release of Over-Claimed Budget By centralizing allocation logic, we can optimize how we handle the discrepancy between estimated and actual memory usage.
- Optimization: Once the
ParquetPartitionReaderdetermines the actual memory required (after applying filters and pruning), we can identify the "over-claimed" portion of the initial budget. - Action: We will introduce a
tryFreemechanism to return this unused budget to the globalHostMemoryPoolimmediately, rather than waiting for the task to finish . This frees up capacity for other runners much sooner.
- Optimization: Once the
Additional context
Deadlock Prevention Strategy
To support this stricter binding without causing deadlocks, we propose specific enhancements to HostMemoryPool:
- Dynamic Borrowing for Underestimated Splits: The initial memory budget is typically derived from the
PartitionedFilesplit length provided by the Spark Driver. This length is often smaller than the actual host memory buffer required because the reader must access offsets outside the split range (e.g., file footers or adjacent row group metadata).- Solution: We introduce
borrowMemory. Whenallocatedetects that thelocalPoolis insufficient, it dynamically borrows the deficit from the global pool . - Priority: Borrowers are prioritized over new tasks to ensure active runners can complete their work .
- Solution: We introduce
- Strategic Over-Commit: If
numRunnerInFlight == 0, the pool will grant resource requests even if they exceed the limit . This guarantees that at least one runner can always proceed, preventing circular wait conditions.