[FEA] Use GPU compression in spill framework for better performance
Summary
When GPU memory is tight, instead of immediately spilling to host, we propose a two-tier approach:
- First, compress on GPU - Often frees 60%+ memory, stays on GPU, 10x faster than spill
- If still not enough, spill the compressed data - Still 2x faster than spilling uncompressed
Benchmark Results (100M int64 rows, 763 MB, skewed distribution)
| Method | Round-trip | GPU Freed | Host Used |
|---|---|---|---|
| Spill Only (current) | 63 ms | 100% | 763 MB |
| Compress Only | 6 ms | 62% | 0 MB |
| Compress + Spill | 30 ms | 100% | 288 MB |
Proposed Strategy
When need to free GPU memory:
1. Compress data on GPU (nvcomp Cascaded/ANS)
2. If freed space is sufficient → DONE (10x faster, no host memory used)
3. If still need more space → Spill compressed data to host (2x faster than raw spill)
Why This Works
- GPU compression is fast: ~200 GB/s (GPU internal bandwidth)
- PCIe is the bottleneck: ~25 GB/s for host transfers
- Compress first: May avoid spill entirely; if spill needed, transfer 62% less data
Non-pinned host memory shows even larger benefit (2.3x faster):
| Method | Pinned | Non-Pinned |
|---|---|---|
| Spill Only | 63 ms | 92 ms |
| Compress + Spill | 30 ms | 41 ms |
Test Configuration
- Data: 100M int64, skewed (63% < 10K, 0.1% > 1B)
- GPU: RTX 4090
- Compression: nvcomp Cascaded (RLE + Delta + BitPacking)
- PCIe: Gen4 x16
Benefits
- Faster: 10x (compress only) or 2x (compress+spill) vs current spill
- Less host memory: 0 MB or 38% of original
- Adaptive: Compress first, spill only if necessary
Related
- #11127 discusses Cascaded for shuffle optimization
- nvcomp already in RAPIDS ecosystem
@revans2 @abellina please let me know your thoughts on this direction. Maybe you have already discussed this before, but didn't find related issues.
Also, the benchmark so far is no where near comprehensive. We may need to try out different types of data, different distribution, etc.
I like this. My main concerns are
- What happens if we don't have enough memory to do the compression? We always need a safe fallback that guarantees success.
- How do we know how much scratch memory will be needed to do the compression?
- Do we need an alternative approach for Grace-Hopper or DGX Spark where the connection between host and device is either very cheap (like Grace-Hopper) or effectively free (like DGX Spark).
The first two we need to solve so that we can have confidence it will always work. The last one is something I want to be sure we think about so that we can get optimal performance in as many cases as possible.