spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

Triple buffering: Bind Virtual Resource Budget to Physical Memory Allocation [databricks]

Open sperlingxx opened this issue 3 weeks ago • 6 comments

Closes #13969

Overview

This PR tightly couples the virtual memory budget with the lifecycle of the actual memory buffer HostMemoryBuffer used in the runner, by making MemoryBoundedAsyncRunner serve as both the resource holder and the HostMemoryAllocator. This design eliminates the previous disconnect between budget accounting and actual memory usage, enabling more precise resource management and improved concurrency.


Design

1. LocalPool Initialization

When a MemoryBoundedAsyncRunner is scheduled for execution, it acquires a preliminary memory budget from the global HostMemoryPool (managed by ResourceBoundedExecutor). This budget becomes the runner's LocalPool—a private memory quota that the runner manages independently during its execution lifecycle.

The initial LocalPool size is typically derived from the PartitionedFile split length, representing an upper-bound estimate of the memory required to process the assigned data partition.

2. LocalPool Structure: Used vs Free

The LocalPool is logically divided into two portions:

Portion Description
Used Memory currently backing live HostMemoryBuffer instances (tracked by usedMem)
Free Remaining budget available for future allocations (localPool - usedMem)

This partitioning allows the runner to track exactly how much of its budget is actively in use versus how much remains available—enabling early release of over-claimed budget.

3. Allocation Flow: Local-First with Dynamic Borrowing

When a buffer allocation request arrives, the runner follows a local-first strategy:

  1. Check LocalPool Free Portion: Attempt to satisfy the request using available free budget
  2. Borrow if Insufficient: If the free portion cannot cover the request, dynamically borrow the deficit from the global HostMemoryPool

Borrowing Semantics:

  • Borrow requests are blocking—the runner waits until sufficient budget becomes available
  • Borrowers have higher priority than runners waiting to acquire initial budget, ensuring that active work completes before new work is scheduled
  • Forceful borrowing: Under certain deadlock-prone conditions (e.g., all in-flight runners are blocked waiting to borrow), the borrow proceeds immediately regardless of available budget. This may leave the HostMemoryPool with a negative remaining balance, but guarantees forward progress

This dynamic borrowing mechanism handles cases where the initial budget estimate is insufficient—such as when file readers need to access metadata beyond the split boundaries (footers, adjacent row groups, etc.).

4. Deallocation Flow: Event-Driven Budget Return

Buffer release triggers an automatic cascade of budget management:

Step 1: Return to LocalPool
When a HostMemoryBuffer is closed (refCount reaches 0), the attached OnCloseHandler fires and returns the corresponding virtual budget back to the runner's LocalPool (decrements usedMem).

Step 2: Early Release via tryFree
If the runner has completed execution (no longer in Running state), the handler triggers tryFree to immediately return the free portion of LocalPool back to the global HostMemoryPool. This releases over-claimed budget as early as possible, improving pool utilization and allowing other runners to be scheduled sooner.

Step 3: Auto-Close on Full Drain
When LocalPool drops to zero—meaning all physical buffers have been closed and all budget has been returned—the runner can be safely closed automatically. This simplifies lifecycle management by eliminating explicit close coordination.


Files Changed

File Changes
AsyncRunners.scala MemoryBoundedAsyncRunner implements HostMemoryAllocator; adds LocalPool tracking (localPool, usedMem), OnCloseHandler for event-driven release, and blocking onClose
ResourcePools.scala Adds borrowMemory with priority semantics and deadlock prevention; updates release to support tryFree and auto-close
ResourceBoundedThreadExecutor.scala Updated lifecycle handling for new runner close semantics
HostAlloc.scala Adds addEventHandler for composing multiple handlers on a single buffer

sperlingxx avatar Dec 08 '25 12:12 sperlingxx