cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[FEA] Refactor hash-based algorithms with new cuco data structures

Open PointKernel opened this issue 2 years ago • 0 comments

Is your feature request related to a problem? Please describe. cuco::static_map and cuco::static_multimap are used to perform hash-based operations in libcudf. Depending on https://github.com/NVIDIA/cuCollections/issues/110, a lot of existing use cases could be replaced with cuco::static_set or cuco:static_multiset since the payload part in the hash map is not used.

Moreover, some libcudf implementations are still using concurrent_unordered_map or unordered_multiset and they should be refactored with cuco::static_set/multiset as well.

Describe the solution you'd like Replace existing implementations with new data structures like cuco::static_set, cuco::static_multiset or cuco::static_map.

There should also be benchmarks showing the performance changes after replacement for most works listed below. Note: under "Related APIs", the ref:: prefix refers to the cuco device-side API.

Part 1: Updates ready to be addressed in libcudf

Algorithm Desired Data Structure Related APIs PRs Notes
distinct_count cuco::static_set insert, insert_if ✅#13343
orc dictionary encoding (issue #10495) cuco::static_map (legacy) ref::insert, ref::find ✅#13580 Similar to parquet dictionary encoding but the current implementation is still using a custom dictionary
byte_pair_encoding cuco::static_map insert_async, ref::find ✅#13807 uses heterogeneous lookup
json_tree cuco::static_set insert_if_async, ref::find ✅#13928
search/contains_table cuco::static_set insert_async, insert_if_async, contains_async, contains_if_async ✅#14064 Needs migration from cudf::detail::unordered_multiset
search/contains_column cuco::static_set insert_async, insert_if_async, contains_async ✅#14238 Needs migration from cudf::detail::unordered_multiset. unordered_multiset can be removed once this is done.
hash-based groupby (issue #10401) cuco::static_set ref::insert_and_find, retrieve_all ✅#14813 Needs migration from concurrent_unordered_map
legacy json reader cuco::static_map (legacy) insert_async, ref::find ✅ #15813 ~Needs migration from concurrent_unordered_map. Looks like a rational use of hash map. To be replaced by cuco::static_map~ Deleted in #15813
distinct cuco::static_set insert_async retrieve_all, ref::insert_and_find ✅ #15984 We have work in flight to move this to cuco::static_map, but it is blocked on performance concerns ⚠️ #16484
parquet dictionary encoding cuco::static_map (experimental) ref::insert, ref::find ✅ #16541 Needs migration from cuco::static_map (legacy)
mixed_semi_joins cuco::static_set insert, insert_if, ref::contains ✅#16230 Needs migration from cuco::static_map (legacy)
semi/anti joins Refactoring not needed anymore, as they use contains
histogram cuco::static_set insert_async retrieve_all, ref::insert_and_find ✅ #16485
orc dictionary encoding cuco::static_map (experimental) ref::insert, ref::find ✅ #17049 Needs migration from cuco::static_map (legacy)

Part 2: Updates depending on upstream cuco changes

Algorithm Desired Data Structure Related APIs PRs Notes
joins cuco::static_multiset count, count_outer, retrieve, retrieve_outer Blocked by retrieve issues https://github.com/NVIDIA/cuCollections/issues/489 https://github.com/NVIDIA/cuCollections/issues/465 Needs migration from cuco::static_multimap (legacy) Related issues: #11176, #13116,
mixed joins cuco::static_multiset Blocked by retrieve issues https://github.com/NVIDIA/cuCollections/issues/489 https://github.com/NVIDIA/cuCollections/issues/465 Needs migration from cuco::static_multimap (legacy)

PointKernel avatar Nov 29 '22 23:11 PointKernel