cudf [FEA] Improve `cudf::distinct` with cuco reduction map

Is your feature request related to a problem? Please describe. #11052 introduces the keep control option into cudf::distinct and makes it possible for users to perform a more efficient hash-based drop_duplicates. The PR uses a single hash map together with thrust algorithms to mimic the behavior of a reduction map. This whole process can be largely simplified once https://github.com/NVIDIA/cuCollections/pull/98 is ready. TODO:

[ ] Replace static_map + thrust algos with cuco::static_reduction_map + cudf::sort
[ ] Update Python bindings to use the hash-based algorithm
[ ] Investigate the performance impact with various map occupancy and sort-based algo v.s. hash-based algo
[ ] Minimize memory footprint

Describe the solution you'd like Uses a cuco::static_reduction_map where the key is the row index and the value is the min/max index of equivalent rows (depending on the keep option).

Describe alternatives you've considered We could also take a pair of row hash value and row index as the key which performs the expensive row hash computation only once for better runtime performance. This requires more memory footprint though. To be evaluated.

Additional context #11656 may not be required by the new reduction map implementation.

Apr 17 '23 20:04 PointKernel

cc @sleeepyjack

Apr 17 '23 20:04 PointKernel

@PointKernel Let's discuss this -- I'm targeting #11656 for 23.06 and it also affects downstream work like #13244. I am surprised that the new cuco work would eliminate the need for a stable/unstable path here. The stability is with respect to key collection, not key insertion (that's controlled by "keep" behavior). Can you give more context about the ordering of the keys that are used to gather distinct results?

May 19 '23 02:05 bdice

@PointKernel Let's discuss this -- I'm targeting #11656 for 23.06 and it also affects downstream work like #13244. I am surprised that the new cuco work would eliminate the need for a stable/unstable path here. The stability is with respect to key collection, not key insertion (that's controlled by "keep" behavior). Can you give more context about the ordering of the keys that are used to gather distinct results?

The current stable_distinct implementation would still be needed I think. This approach using the proposed cuco insert_or_apply interface in https://github.com/NVIDIA/cuCollections/pull/384 (which supersedes https://github.com/NVIDIA/cuCollections/pull/98) would avoid the need for the thrust-based post-processing of the result that is currently done in the keep != ANY case.

Nov 17 '23 10:11 wence-

Just to note, I would like any putative API that exposes the insert_or_apply interface in libcudf to allow user-provided binops. Any value type (within the usual cuco constraints) with monoid structure overlaid should be allowable. This would allow, for example, implementation of value_counts (count multiplicity of unique rows) which is needed in cudf's Index.union implementation (or will be soon), by using (0, +). Right now, we must compute a groupby over the rows and then compute groupby.size() on a dummy column to deduce multiplicity

As noted, the distinct flavours use (MAX_INT, min) (for keep=first), (MIN_INT, max) (for keep=last).

Dec 04 '23 11:12 wence-