spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[FEA] Use GPU compression in spill framework for better performance

Open binmahone opened this issue 1 month ago • 2 comments

Summary

When GPU memory is tight, instead of immediately spilling to host, we propose a two-tier approach:

  1. First, compress on GPU - Often frees 60%+ memory, stays on GPU, 10x faster than spill
  2. If still not enough, spill the compressed data - Still 2x faster than spilling uncompressed

Benchmark Results (100M int64 rows, 763 MB, skewed distribution)

Method Round-trip GPU Freed Host Used
Spill Only (current) 63 ms 100% 763 MB
Compress Only 6 ms 62% 0 MB
Compress + Spill 30 ms 100% 288 MB

Proposed Strategy

When need to free GPU memory:
  1. Compress data on GPU (nvcomp Cascaded/ANS)
  2. If freed space is sufficient → DONE (10x faster, no host memory used)
  3. If still need more space → Spill compressed data to host (2x faster than raw spill)

Why This Works

  • GPU compression is fast: ~200 GB/s (GPU internal bandwidth)
  • PCIe is the bottleneck: ~25 GB/s for host transfers
  • Compress first: May avoid spill entirely; if spill needed, transfer 62% less data

Non-pinned host memory shows even larger benefit (2.3x faster):

Method Pinned Non-Pinned
Spill Only 63 ms 92 ms
Compress + Spill 30 ms 41 ms

Test Configuration

  • Data: 100M int64, skewed (63% < 10K, 0.1% > 1B)
  • GPU: RTX 4090
  • Compression: nvcomp Cascaded (RLE + Delta + BitPacking)
  • PCIe: Gen4 x16

Benefits

  1. Faster: 10x (compress only) or 2x (compress+spill) vs current spill
  2. Less host memory: 0 MB or 38% of original
  3. Adaptive: Compress first, spill only if necessary

Related

  • #11127 discusses Cascaded for shuffle optimization
  • nvcomp already in RAPIDS ecosystem

binmahone avatar Dec 04 '25 07:12 binmahone

@revans2 @abellina please let me know your thoughts on this direction. Maybe you have already discussed this before, but didn't find related issues.

Also, the benchmark so far is no where near comprehensive. We may need to try out different types of data, different distribution, etc.

binmahone avatar Dec 04 '25 07:12 binmahone

I like this. My main concerns are

  1. What happens if we don't have enough memory to do the compression? We always need a safe fallback that guarantees success.
  2. How do we know how much scratch memory will be needed to do the compression?
  3. Do we need an alternative approach for Grace-Hopper or DGX Spark where the connection between host and device is either very cheap (like Grace-Hopper) or effectively free (like DGX Spark).

The first two we need to solve so that we can have confidence it will always work. The last one is something I want to be sure we think about so that we can get optimal performance in as many cases as possible.

revans2 avatar Dec 08 '25 16:12 revans2