cudf [FEA] Refactor to eliminate redundant device aggregation logic

[FEA] Refactor to eliminate redundant device aggregation logic

Open PointKernel opened this issue 4 months ago • 0 comments

Is your feature request related to a problem? Please describe. Once #17031 is merged, three copies of similar device aggregator logic will exist in libcudf, and we need to address this issue.

Currently, we cannot share the same code path because the existing device aggregator only accepts column_device_view as input, and libcudf is unable to construct a column_device_view from shared memory at this point.

Describe the solution you'd like Based on our offline discussions, adding an overload for the column_device_view ctor may not be the best approach. Instead, we should consider introducing a new object similar to device_column_view specifically designed for use in shared memory.

Ultimately, we aim to have a single aggregator that manages all types of aggregations: global-global, shared-global, and global-shared.

Additional context

The new gmem_element_aggregator and shmem_element_aggregator cannot be used to deal with dictionary columns due to two reasons:
- CUDA error: on V100, they will cause a cudaErrorInvalidValue when querying the available dynamic shared memory size with cudaOccupancyAvailableDynamicSMemPerBlock. This issue is likely caused by the dictionary template instantiation of the aggregator, which triggers a nested invocation of the type dispatcher. Note this error is seen on V100 but not on RTX8000.
- Performance: Simply adding dictionary instantiations without actually using them, i.e., without dictionary columns involved in the calculation, can make the group performance up to 5x slower.

Oct 09 '24 20:10 PointKernel

cudf cudf copied to clipboard

[FEA] Refactor to eliminate redundant device aggregation logic

cudf
cudf copied to clipboard