cudf Refactor `distinct` using `static_map` `insert_or

Description

This PR refactors distinct using static_map::insert_or_apply for keep_first, keep_last and keep_none options.

Checklist

[x] I am familiar with the Contributing Guidelines.
[ ] New or existing tests cover these changes.
[ ] The documentation is up to date with these changes.

Aug 02 '24 18:08 srinivasyadav18

@bdice @PointKernel, I pushed all the cleanups and minor optimizations. The PR is ready for review.

Also, I don't think we might need extra optimization on keep_none. As @bdice mentioned, transform counting iterator with copy_if should be efficient enough.

Aug 02 '24 23:08 srinivasyadav18

@srinivasyadav18 can you please share the performance comparisons before and after this PR?

Aug 05 '24 18:08 PointKernel

Based on this gist, here are the most important performance signals:

I'm comparing the 1B row case, removing any because it's not impacted by this PR, and removing last because it is identical to first.

drop at 1B for keep=first
drop at medium-low cardinality (1-100K)
improvement at low cardinality (100) for I32
improvement at 1-100M

Overall, I believe this performance signature is a wash. Perhaps the shifting throughputs are due to cache locality differences. It's worth profiling the 1B rows, 1B cardinality, keep first case and the 1B rows, 100K cardinality, keep first case.

Aug 06 '24 13:08 GregoryKimball

Thanks @GregoryKimball for providing the plot and important performance signals.

I have done some profiling on 1 Billion keys and 1 Billion cardinality, and this what I found out.

Overall distint algorithm has 2 major steps

distinct_indices
- initalize data strucutre
- perform the reductions (reduce_by_row)
- return the output_indices (all the distinct indices based on keep option)
gather

Time taken for each step with base branch-24.10:

distinct_indices
- initalize data strucutre (cuco::static_set, reduction_results device_uvector) TIME = 5ms
- perform the reductions (reduce_by_row) TIME = 65.589ms
- return the output_indices (all the distinct indices based on keep option) TIME = 4ms
gather ##TIME = 15ms##

Below is the profiling of base branch-24.10 (selected region in green shade shows gather runtime)

base

Time taken for each step with insert_or_apply:

distinct_indices
- initalize data strucutre TIME = 7.8ms
- perform the reductions (insert_or_apply) (thrust::for_each with set_ref) TIME = 55.7ms
- return the output_indices (retrieve_all) TIME = 11.8ms
gather ##TIME = 53.364##

Below is the profiling with insert_or_apply (selected region in green shade shows gather runtime) insert_or_apply_gather

In summary, the gather algorithm is causing the main bottleneck. It looks like the gather algorithm is efficient only when the row indices are already sorted (this makes sense because there is less data movement).

I tested the insert_or_apply implementtation to sort the array before returning from the distinct_algorithm, and performance improved signifcantly as now gather takes very less time.

Below is the profiling with insert_or_apply with sorting the distinct indices: (selected region in green shade shows gather runtime) (red shows the sorting overhead)

sort_insert_or_apply_gather

Aug 06 '24 15:08 srinivasyadav18

In summary, the branch-24.10 implementation is already efficient as it performs reduction and also maintains the ordering for reduction values (as they are row-indices).

insert-or-apply kernel improves performance more than 15%, but due to unordered indices as the output, the gather algorithm which does post-processing regresses the performance more than 50%, hence causing the overall performance regression in the pipeline.

Sorting helps the situation in large input cases, but causes regression in low input cases (see here in gist for numbers).

Aug 06 '24 16:08 srinivasyadav18

Thank you @srinivasyadav18 for studying the 1B cardinality case. The gist you posted shows the benefit.

It would be worth a profile and check of the 100K cardinality case, but I don't think the moderate cardinality performance is a blocker

Aug 06 '24 17:08 GregoryKimball

Instead of sorting, how about using scatter-gather approach?

Initialize a gather map with all 0
Scatter the output indices into the gather map, marking the position of these indices as 1
Gather using that gather map.

I'm starting to review this PR. Sorry for jumping late.

Aug 06 '24 17:08 ttnghia

We hope that https://github.com/rapidsai/cudf/pull/15700 will improve the overall performance outlook of static_map in distinct

Aug 16 '24 04:08 GregoryKimball

Update: waiting for #15700 to merge to determine if the current PR can improve performance after resolving the register issues.

Oct 31 '24 17:10 PointKernel

We could re-benchmark this feature using a primitive row operator as in https://github.com/rapidsai/cudf/pull/18896

Jun 24 '25 22:06 GregoryKimball

cudf
cudf copied to clipboard

Refactor `distinct` using `static_map` `insert_or_apply`

Description

Checklist

cudf cudf copied to clipboard

Refactor `distinct` using `static_map` `insert_or_apply`

Description

Checklist

cudf
cudf copied to clipboard