Triple buffering: Bind Virtual Resource Budget to Physical Memory Allocation [databricks]
Closes #13969
Overview
This PR tightly couples the virtual memory budget with the lifecycle of the actual memory buffer HostMemoryBuffer used in the runner, by making MemoryBoundedAsyncRunner serve as both the resource holder and the HostMemoryAllocator. This design eliminates the previous disconnect between budget accounting and actual memory usage, enabling more precise resource management and improved concurrency.
Design
1. LocalPool Initialization
When a MemoryBoundedAsyncRunner is scheduled for execution, it acquires a preliminary memory budget from the global HostMemoryPool (managed by ResourceBoundedExecutor). This budget becomes the runner's LocalPool—a private memory quota that the runner manages independently during its execution lifecycle.
The initial LocalPool size is typically derived from the PartitionedFile split length, representing an upper-bound estimate of the memory required to process the assigned data partition.
2. LocalPool Structure: Used vs Free
The LocalPool is logically divided into two portions:
| Portion | Description |
|---|---|
| Used | Memory currently backing live HostMemoryBuffer instances (tracked by usedMem) |
| Free | Remaining budget available for future allocations (localPool - usedMem) |
This partitioning allows the runner to track exactly how much of its budget is actively in use versus how much remains available—enabling early release of over-claimed budget.
3. Allocation Flow: Local-First with Dynamic Borrowing
When a buffer allocation request arrives, the runner follows a local-first strategy:
- Check LocalPool Free Portion: Attempt to satisfy the request using available free budget
- Borrow if Insufficient: If the free portion cannot cover the request, dynamically borrow the deficit from the global
HostMemoryPool
Borrowing Semantics:
- Borrow requests are blocking—the runner waits until sufficient budget becomes available
- Borrowers have higher priority than runners waiting to acquire initial budget, ensuring that active work completes before new work is scheduled
- Forceful borrowing: Under certain deadlock-prone conditions (e.g., all in-flight runners are blocked waiting to borrow), the borrow proceeds immediately regardless of available budget. This may leave the
HostMemoryPoolwith a negative remaining balance, but guarantees forward progress
This dynamic borrowing mechanism handles cases where the initial budget estimate is insufficient—such as when file readers need to access metadata beyond the split boundaries (footers, adjacent row groups, etc.).
4. Deallocation Flow: Event-Driven Budget Return
Buffer release triggers an automatic cascade of budget management:
Step 1: Return to LocalPool
When a HostMemoryBuffer is closed (refCount reaches 0), the attached OnCloseHandler fires and returns the corresponding virtual budget back to the runner's LocalPool (decrements usedMem).
Step 2: Early Release via tryFree
If the runner has completed execution (no longer in Running state), the handler triggers tryFree to immediately return the free portion of LocalPool back to the global HostMemoryPool. This releases over-claimed budget as early as possible, improving pool utilization and allowing other runners to be scheduled sooner.
Step 3: Auto-Close on Full Drain
When LocalPool drops to zero—meaning all physical buffers have been closed and all budget has been returned—the runner can be safely closed automatically. This simplifies lifecycle management by eliminating explicit close coordination.
Files Changed
| File | Changes |
|---|---|
AsyncRunners.scala |
MemoryBoundedAsyncRunner implements HostMemoryAllocator; adds LocalPool tracking (localPool, usedMem), OnCloseHandler for event-driven release, and blocking onClose |
ResourcePools.scala |
Adds borrowMemory with priority semantics and deadlock prevention; updates release to support tryFree and auto-close |
ResourceBoundedThreadExecutor.scala |
Updated lifecycle handling for new runner close semantics |
HostAlloc.scala |
Adds addEventHandler for composing multiple handlers on a single buffer |