spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

Investigate treating scalars in a column batch as dictionary columns

Open jlowe opened this issue 2 years ago • 3 comments

There are a number of cases where we end up wanting to treat a scalar as a column in a columnar batch or cudf table, and that forces us to replicate the scalar for every row in the batch/table so it can be a column. This can be very wasteful when the scalar is of a significant size (e.g.: a long string, a complex type, etc.)

One potential way to help improve this situation would be to use dictionary columns, where we create a dictionary of one entry, the scalar, and the indices are all the same, pointing to that one entry for each row. If we can preserve that dictionary through operations (e.g.: gather on the column results in another dictionary column rather than exploding the dictionary column out into a standard column), we could achieve some memory and memory bandwidth savings.

This could be particularly useful for batches containing the input filename or other partition-like values that are large and currently exploded out for every row.

jlowe avatar Feb 03 '23 19:02 jlowe

Would need to add dictionary support.

sameerz avatar Feb 07 '23 22:02 sameerz

This could be a huge win for memory on cases like https://github.com/NVIDIA/spark-rapids/issues/10561 where we insert in a lot of null columns as place holders knowing that they will never be used, and likely replaced by more columns of only nulls. Perhaps we can even ask CUDF for a scalar column as a further optimization.

revans2 avatar Mar 07 '24 15:03 revans2

I filed https://github.com/rapidsai/cudf/issues/15308 for this in CUDF. We will see what happens.

revans2 avatar Mar 14 '24 21:03 revans2