[FEA] Column-wise columnar batch concatenation

Open sperlingxx opened this issue 3 weeks ago • 6 comments

Is your feature request related to a problem? Please describe. Based on insights from #13884, we observed severe OOM retries and semaphore waits during batch concatenation.

The current implementation of concatBatches in spark-rapids materializes all SpillableColumnarBatch instances simultaneously and performs table-level concatenation, one of typical usages is concatenateBatchesWithRetry in GpuAggregateIterator. This approach may lead to extreme high peak memory usage, regarding all batches are materialized at once, keeping entire tables in memory throughout the concatenation process. Correspondingly, plenty of time might be spent on OOMRetry.

Describe the solution you'd like

Implement column-wise concatenation that processes one column at a time instead of concatenating entire tables: Key improvements:

Column-by-column materialization: Extract and concatenate one column at a time from spillable batches
Immediate deallocation: Free memory for source columns as soon as each result column is produced

Additional context Furthermore, we can even try to spill concatenated columns, so as to reduce the peak device memory overhead by keeping minimum columns in memory.

Dec 02 '25 07:12 sperlingxx