spark-rapids [FEA] Use GPU compression in spill framework for better performance

Summary

When GPU memory is tight, instead of immediately spilling to host, we propose a two-tier approach:

First, compress on GPU - Often frees 60%+ memory, stays on GPU, 10x faster than spill
If still not enough, spill the compressed data - Still 2x faster than spilling uncompressed

Benchmark Results (100M int64 rows, 763 MB, skewed distribution)

Method	Round-trip	GPU Freed	Host Used
Spill Only (current)	63 ms	100%	763 MB
Compress Only	6 ms	62%	0 MB
Compress + Spill	30 ms	100%	288 MB

Proposed Strategy

When need to free GPU memory:
  1. Compress data on GPU (nvcomp Cascaded/ANS)
  2. If freed space is sufficient → DONE (10x faster, no host memory used)
  3. If still need more space → Spill compressed data to host (2x faster than raw spill)

Why This Works

GPU compression is fast: ~200 GB/s (GPU internal bandwidth)
PCIe is the bottleneck: ~25 GB/s for host transfers
Compress first: May avoid spill entirely; if spill needed, transfer 62% less data

Non-pinned host memory shows even larger benefit (2.3x faster):

Method	Pinned	Non-Pinned
Spill Only	63 ms	92 ms
Compress + Spill	30 ms	41 ms

Test Configuration

Data: 100M int64, skewed (63% < 10K, 0.1% > 1B)
GPU: RTX 4090
Compression: nvcomp Cascaded (RLE + Delta + BitPacking)
PCIe: Gen4 x16

Benefits

Faster: 10x (compress only) or 2x (compress+spill) vs current spill
Less host memory: 0 MB or 38% of original
Adaptive: Compress first, spill only if necessary

What happens if we don't have enough memory to do the compression? We always need a safe fallback that guarantees success.
How do we know how much scratch memory will be needed to do the compression?
Do we need an alternative approach for Grace-Hopper or DGX Spark where the connection between host and device is either very cheap (like Grace-Hopper) or effectively free (like DGX Spark).

The first two we need to solve so that we can have confidence it will always work. The last one is something I want to be sure we think about so that we can get optimal performance in as many cases as possible.

Dec 08 '25 16:12 revans2

[FEA] Use GPU compression in spill framework for better performance

Summary

Benchmark Results (100M int64 rows, 763 MB, skewed distribution)

Proposed Strategy

Why This Works

Test Configuration

Benefits

Related