cudf
cudf copied to clipboard
Refactor `distinct` using `static_map` `insert_or_apply`
Description
This PR refactors distinct using static_map::insert_or_apply for keep_first, keep_last and keep_none options.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [ ] New or existing tests cover these changes.
- [ ] The documentation is up to date with these changes.
@bdice @PointKernel, I pushed all the cleanups and minor optimizations. The PR is ready for review.
Also, I don't think we might need extra optimization on keep_none. As @bdice mentioned, transform counting iterator with copy_if should be efficient enough.
@srinivasyadav18 can you please share the performance comparisons before and after this PR?
Based on this gist, here are the most important performance signals:
I'm comparing the 1B row case, removing any because it's not impacted by this PR, and removing last because it is identical to first.
- drop at 1B for
keep=first - drop at medium-low cardinality (1-100K)
- improvement at low cardinality (100) for I32
- improvement at 1-100M
Overall, I believe this performance signature is a wash. Perhaps the shifting throughputs are due to cache locality differences. It's worth profiling the 1B rows, 1B cardinality, keep first case and the 1B rows, 100K cardinality, keep first case.
Thanks @GregoryKimball for providing the plot and important performance signals.
I have done some profiling on 1 Billion keys and 1 Billion cardinality, and this what I found out.
Overall distint algorithm has 2 major steps
- distinct_indices
- initalize data strucutre
- perform the reductions (reduce_by_row)
- return the output_indices (all the distinct indices based on keep option)
- gather
Time taken for each step with base branch-24.10:
- distinct_indices
- initalize data strucutre (cuco::static_set, reduction_results device_uvector) TIME = 5ms
- perform the reductions (reduce_by_row) TIME = 65.589ms
- return the output_indices (all the distinct indices based on keep option) TIME = 4ms
- gather ##TIME = 15ms##
Below is the profiling of base branch-24.10 (selected region in green shade shows gather runtime)
Time taken for each step with insert_or_apply:
- distinct_indices
- initalize data strucutre TIME = 7.8ms
- perform the reductions (insert_or_apply) (thrust::for_each with set_ref) TIME = 55.7ms
- return the output_indices (retrieve_all) TIME = 11.8ms
- gather ##TIME = 53.364##
Below is the profiling with insert_or_apply (selected region in green shade shows gather runtime)
In summary, the gather algorithm is causing the main bottleneck. It looks like the gather algorithm is efficient only when the row indices are already sorted (this makes sense because there is less data movement).
I tested the insert_or_apply implementtation to sort the array before returning from the distinct_algorithm, and performance improved signifcantly as now gather takes very less time.
Below is the profiling with insert_or_apply with sorting the distinct indices: (selected region in green shade shows gather runtime) (red shows the sorting overhead)
In summary, the branch-24.10 implementation is already efficient as it performs reduction and also maintains the ordering for reduction values (as they are row-indices).
insert-or-apply kernel improves performance more than 15%, but due to unordered indices as the output, the gather algorithm which does post-processing regresses the performance more than 50%, hence causing the overall performance regression in the pipeline.
Sorting helps the situation in large input cases, but causes regression in low input cases (see here in gist for numbers).
Thank you @srinivasyadav18 for studying the 1B cardinality case. The gist you posted shows the benefit.
It would be worth a profile and check of the 100K cardinality case, but I don't think the moderate cardinality performance is a blocker
Instead of sorting, how about using scatter-gather approach?
- Initialize a gather map with all
0 - Scatter the output indices into the gather map, marking the position of these indices as
1 - Gather using that gather map.
I'm starting to review this PR. Sorry for jumping late.
We hope that https://github.com/rapidsai/cudf/pull/15700 will improve the overall performance outlook of static_map in distinct
Update: waiting for #15700 to merge to determine if the current PR can improve performance after resolving the register issues.
We could re-benchmark this feature using a primitive row operator as in https://github.com/rapidsai/cudf/pull/18896