cuCollections icon indicating copy to clipboard operation
cuCollections copied to clipboard

[ENHANCEMENT]: Add hasher/comparator adaptor to better support mapping table use case

Open PointKernel opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe.

Many hash-based implementations rely on the mapping table method to handle large keys/values that are expensive to copy or move around. Every time users need to write the mapping table hasher/key_equal adaptors on their own, they shouldn't have to.

Describe the solution you'd like

As proposed by @esoha-nvidia, to better serve all users in this scenario, cuco could provide two handy utilities to the user:

  1. mapping_table_hash wrapper constructed from a span (a pointer to the beginning of the original data and a size) and the original hasher
  2. mapping_table_key_eq wrapper constructed from a span and the original key comparator

TODO:

  • [ ] Add two utilities (probably in include/cuco/utility/mapping_table.cuh)
  • [ ] Update the example accordingly
  • [ ] Update godbolt link in README

Describe alternatives you've considered

Seeking help for better naming

  • Is the mapping_table_ prefix descriptive enough? Any other suggestions?
  • STL also use pred/predicate to refer to key comparators but I found mapping_table_key_pred/mapping_table_pred can be misleading since we support conditional operations like insert_if and contains_if where the condition is also named as predicate
  • mapping_table_equal good enough? key_eq is the STL canonical but the _eq abbreviation is ambiguous.

PointKernel avatar May 07 '24 01:05 PointKernel

The complication with this adaptor is that the key_equal function depends on the which table is in the input.

For example, say we are doing a bulk_insert followed by a bulk_lookup. And imagine, like in the case of a database, the insert is primary keys from one table and the lookup is foreign keys from another table.

When you do the insert, the key_equal function must perform primary_keys[index_to_insert] == primary_keys[slots_index]. But when we do the lookup, we must perform lookup_keys[index_to_lookup] == primary_keys[slots_index]. Above we have:

mapping_table_key_eq wrapper constructed from a pointer to the beginning of the original data, a size and the original key comparator

So which will be the "pointer to the beginning of the original data"? &primary_keys[0] or &lookup_keys[0]? It's impossible.

One solution would be, for example, to store both inside the equality functor and then make the index a tuple that indicates which table we should use for lookup. This is really wasteful of memory and also quite slow. The problem here is that when we are doing an insert versus a lookup, we know which table should be used for comparison, but that knowledge is thrown away because the equality operator is unable to use it.

I don't have a great solution for this, which is a major reason why I was unable to use cuCo. One possibility is to perhaps make an adapter around an existing hash table that would let you swap the equality operator. For example:

my_hash_table.bulk_insert(keys_to_insert); // Insert using default equality operator
mapping_table_key_eq lookup_table_eq(lookup_table); // Create a new equality operator.
my_hash_table.template with_eq<lookup_table_eq>(lookup_table_eq).bulk_lookup(lookup_table);

The with_eq function is creating a "view" of the original slots but with a new equality operator. The implementation is basically just a reinterpret_cast that modifies a template parameter, so it has no runtime cost. I can't think of a more elegant solution that still adheres to the STL hashing paradigm.

esoha-nvidia avatar May 07 '24 15:05 esoha-nvidia

Here are some use cases

https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/src/join/join_common_utils.cuh#L47-L66 https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/include/cudf/detail/distinct_hash_join.cuh#L65-L77 https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/src/search/contains_table.cu#L44-L62 https://github.com/rapidsai/cudf/blob/60d5717ba5b9a51cb031b506885a656e50199d22/cpp/src/text/bpe/byte_pair_encoding.cuh#L48-L73

esoha-nvidia avatar May 21 '24 17:05 esoha-nvidia

As mentioned in slack, I would like to work on this issue

njones93531 avatar Jun 03 '24 14:06 njones93531