[ENHANCEMENT]: Add hasher/comparator adaptor to better support mapping table use case
Is your feature request related to a problem? Please describe.
Many hash-based implementations rely on the mapping table method to handle large keys/values that are expensive to copy or move around. Every time users need to write the mapping table hasher/key_equal adaptors on their own, they shouldn't have to.
Describe the solution you'd like
As proposed by @esoha-nvidia, to better serve all users in this scenario, cuco could provide two handy utilities to the user:
mapping_table_hashwrapper constructed from a span (a pointer to the beginning of the original data and a size) and the original hashermapping_table_key_eqwrapper constructed from a span and the original key comparator
TODO:
- [ ] Add two utilities (probably in
include/cuco/utility/mapping_table.cuh) - [ ] Update the example accordingly
- [ ] Update godbolt link in README
Describe alternatives you've considered
Seeking help for better naming
- Is the
mapping_table_prefix descriptive enough? Any other suggestions? - STL also use
pred/predicateto refer to key comparators but I foundmapping_table_key_pred/mapping_table_predcan be misleading since we support conditional operations likeinsert_ifandcontains_ifwhere the condition is also named aspredicate mapping_table_equalgood enough?key_eqis the STL canonical but the_eqabbreviation is ambiguous.
The complication with this adaptor is that the key_equal function depends on the which table is in the input.
For example, say we are doing a bulk_insert followed by a bulk_lookup. And imagine, like in the case of a database, the insert is primary keys from one table and the lookup is foreign keys from another table.
When you do the insert, the key_equal function must perform primary_keys[index_to_insert] == primary_keys[slots_index]. But when we do the lookup, we must perform lookup_keys[index_to_lookup] == primary_keys[slots_index]. Above we have:
mapping_table_key_eqwrapper constructed from a pointer to the beginning of the original data, a size and the original key comparator
So which will be the "pointer to the beginning of the original data"? &primary_keys[0] or &lookup_keys[0]? It's impossible.
One solution would be, for example, to store both inside the equality functor and then make the index a tuple that indicates which table we should use for lookup. This is really wasteful of memory and also quite slow. The problem here is that when we are doing an insert versus a lookup, we know which table should be used for comparison, but that knowledge is thrown away because the equality operator is unable to use it.
I don't have a great solution for this, which is a major reason why I was unable to use cuCo. One possibility is to perhaps make an adapter around an existing hash table that would let you swap the equality operator. For example:
my_hash_table.bulk_insert(keys_to_insert); // Insert using default equality operator
mapping_table_key_eq lookup_table_eq(lookup_table); // Create a new equality operator.
my_hash_table.template with_eq<lookup_table_eq>(lookup_table_eq).bulk_lookup(lookup_table);
The with_eq function is creating a "view" of the original slots but with a new equality operator. The implementation is basically just a reinterpret_cast that modifies a template parameter, so it has no runtime cost. I can't think of a more elegant solution that still adheres to the STL hashing paradigm.
Here are some use cases
https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/src/join/join_common_utils.cuh#L47-L66 https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/include/cudf/detail/distinct_hash_join.cuh#L65-L77 https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/src/search/contains_table.cu#L44-L62 https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/src/text/bpe/byte_pair_encoding.cuh#L48-L73
As mentioned in slack, I would like to work on this issue