cuCollections cuco::bloom

Superseeds #101

Implementation of a GPU "Blocked Bloom Filter".

This PR is an updated/optimized version of #101 and features the following improvements:

Incorporate the new library design
Improve performance by computing the key's bit pattern based on a single hash value instead of using a double hashing derivative

TODOs:

[x] Add docs
[x] Add unit tests
[x] Move implementation to .inl and maybe apply PIMPL principle (TBD once we settle on the final class signature because it would be a PITA to apply changes once after we split the definition from the declaration)
[x] Evaluate and optimize performance on H100 ~- [ ] Implement test_and_add, i.e., a function that returns whether the key was already present before inserting it.~ not in this PR ~- [ ] Add more ctor variants (based on desired FPR, etc.)~ not in this PR ~- [ ] Add support for hashers that return a tuple- or array-like hash value~ not in this PR

Aug 08 '24 01:08 sleeepyjack

CC @kkraus14

Aug 08 '24 01:08 sleeepyjack

Thanks for throwing this up @sleeepyjack! I've pinged some folks on my side to take a pass at reviewing the host and device APIs as well as general functionality here in order to provide some high level feedback as a starting point.

Aug 08 '24 02:08 kkraus14

Thanks for working on this. I know that we are really excited at the prospect of being able to use this to accelerate some of our workloads. I want to describe some of the ways we were hoping to use this.

Test if sets are disjoint in a distributed context.

We would like to be able to serialize and deserialize the underlying data of the bloom filter to be able to send it across the network, preferably something that doesn't require a copy of the structure since we use libraries like UCX which can send directly from the GPU. Not being able to do this would probably make this unusable for us because of our distributed use case.

Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters as described in these papers Understanding Bloom Filter Intersection for Lazy Address-Set Disambiguation A Case for Partitioned Bloom Filters

You might also consider a contains API that doesn't write output to an iterator but rather to a boolean indicating whether or not the the bloom filter contains an element from from the input being offered and an API that can take two bloom filters and test if the sets that produced them are disjoint. so something likebloom_filter.contains(other_bloom_filter).

Use bloom filters on large distributed joins to filter out rows from tables in join

Say you want to join two tables. One is 1TB another is 10GB. We would like to be able to make a bloom filter from subsets of the 10GB table and then use those bloom filters to test if subsets of the 1TB table are disjoint with those subsets of the 10GB table.

There are two features we could benefit from to help enable this work.

One is that we build bloom filters for all subsets of the tables (these might be distributed across many nodes) and then shuffle these around to test and see which subsets combinations can be ruled out for joining. So in this case its the same api I mentioned above where you can see if two bloom filters inputs were disjoint.

The other is that when they are possibly not disjoint we could then apply the bloom filter row by row with APIs that seem to be already available in this PR.

Be able to make bloom filters from multiple columns

Say someone is joining to tables on two columns, e.g. a.x = b.x and a.y = b.y

The bloom filter we would want to make in this case would be one where the input was a combination of x and y rather than having to make two bloom filters one for each column. This would be better and lowering the false positive rate than if we were to test them seperately.

Aug 09 '24 16:08 felipeblazing

Thanks for the valuable insights, @felipeblazing !

I can address some of the points right away:

We would like to be able to serialize and deserialize the underlying data of the bloom filter to be able to send it across the network, preferably something that doesn't require a copy of the structure

This should be doable. We did a similar thing for our distinct_count_estimator aka HyperLogLog++ implementation. tl;dr we provide access to the underlying memory span so it can be exported and serialized. Deserialization can be done by constructing a new _ref object over the raw byte vector. Here is an example on how this works with our HLL implementation.

Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters

Thanks for sharing those papers. I don't have access permissions so I requested them. From my rough understanding, a partitioned Bloom filter stores the signature for each key in k different memory locations that are likely to be far apart from each other. I have some concerns if this scales well on the GPU. With the current blocked Bloom filter approach we aim to minimize the number of sector loads/stores as much as possible (in the best case the entire fingerprint of a key falls into a single sector or even word), since the data structure is strictly bottlenecked by random access performance of the memory system. I will dive through the papers to see if there's something we can do to allow for distributed workloads.

You might also consider a contains API that doesn't write output to an iterator but rather to a boolean indicating whether or not the the bloom filter contains an element from from the input being offered and an API that can take two bloom filters and test if the sets that produced them are disjoint. so something likebloom_filter.contains(other_bloom_filter).

Yep, this one should be easy to implement. Naming-wise I would go with something like contains_any(…) but that's just an idea.

Be able to make bloom filters from multiple columns

I think this would already work by customizing the hash function similar to what cudf does with their row_hasher. You could combine the hash values for each individual column like so: h_a = hash(a.x)^hash(a.y). If a.x = b.x and a.y = b.y then h_a = h_b. Although I think cudf's row hasher does something smarter than that.

Aug 09 '24 23:08 sleeepyjack

Although I think cudf's row hasher does something smarter than that.

cuDF has 32-bit and 64-bit hash_combine implementations. Usually, the crucial thing here is that you need a non-commutative function so that combine(left, right) and combine(right, left) give different results.

Aug 12 '24 21:08 bdice

Another thing that might be considered is that in particular for testing if sets are disjoint is considering using partitioned bloom filters as described in these papers Understanding Bloom Filter Intersection for Lazy Address-Set Disambiguation A Case for Partitioned Bloom Filters

Hi @felipeblazing! This looks to me, based on the code as well as the PR message, like it is based on Apache Impala's Bloom filters, which in turn infiltrated Apache Arrow, Apache Parquet, and so on - at least as long as pattern_bits_ is no more than window_size. I mention this because the Impala filters actually are partitioned Bloom filters - but partitioned within a small contiguous block, with each partition being the size of a single word. (See also Section 3 of Performance-Optimal Filtering: Bloom Overtakes Cuckoo at High Throughput, which calls this partitioning within a block "sectorization").

On the other hand, when pattern_bits_ is greater than window_size (which looks like the default, since window_size defaults to 1), this is only partially sectorized/partitioned - multiple bits will be set in some or all partitions. For a window_size of 1, this effectively means no partitioning.

For SIMD on CPU, sectorized/partitioned with a pattern_bits_ == window_size == 8 was the right choice for Impala's performance, as it accessed then only 256 bits (thus fitting within one cache line), but I'm guessing that window_size == 1 has better performance for Nvidia GPUs, based on much smaller (32 bit?) cache lines (or the the GPU equivalent)? I don't know enough about GPU hardware yet to be know that. However, if more cache locality is available, the extent to which the filter is sectorized/partitioned can be increased by increasing window_size while keeping pattern_bits_ the same. This would also reduce the false positive probability without increasing the total filter size.

Aug 13 '24 18:08 jbapple

Impala filters actually are partitioned Bloom filters - but partitioned within a small contiguous block, with each partition being the size of a single word.

Very interesting point! Yes, you could apply the same approach to the blocked (or sectorized) Bloom filter in this PR.

For SIMD on CPU, sectorized/partitioned with a pattern_bits_ == window_size == 8 was the right choice for Impala's performance, as it accessed then only 256 bits (thus fitting within one cache line), but I'm guessing that window_size == 1 has better performance for Nvidia GPUs, based on much smaller (32 bit?) cache lines (or the the GPU equivalent)?

GPUs follow a similar principle where the GPU's cache line size is either 32byte (a sector) or 4*32byte aka an L2 slice depending on how you look at it. In 9332c9a I was able to fix some performance issues for when the window_size > 1. I owe you folks some new benchmarks, but my initial observation was that with this fix performance stays more or less constant (close to SOL GUPS throughput) up to the point where the window_size exceeds the size of a single sector. So yeah, having a partitioned/sectorized filter with good performance is totally feasible with the current design.

Aug 13 '24 22:08 sleeepyjack

. . .performance stays more or less constant (close to SOL GUPS throughput) up to the point where the window_size exceeds the size of a single sector.

Would it make sense, then, for window_size to be min(8, pattern_bits_) by default as long as pattern_bits_ is a power of 2 (since sizeof(word_type) == 4 and a sector is 8 * 4 = 32 bytes?), rather than 1 by default?

With some other small changes, this could also support using the Bloom filters from Parquet, KVRocks, and Impala (and forks Doris and StarRocks) without having to read the input keys and re-encode a new Bloom filter. I can imagine this might be of use mostly in the Parquet case.

Aug 14 '24 01:08 jbapple

Hey folks,

I ran some new benchmarks on H100 and wanted to share the results: bloom_filter_h100.csv

I ran the outer product of all config dims so the result file is quite large. It would be great if some more data scientist-esque person than me could figure out how to make sense/visualize this pile of data. For now I'm going with a default filter block extent of 32B (4*8B) but the benchmark tests different setups as well.

A few more notes regarding the benchmark data:

Input size (aka number of keys) is fixed and might be slightly too big to give sensitive information about the FPR. However, if I size the input smaller then the kernels run in <1ms on H100 which adds a ton of noise.
I added some cupti measurements for bandwidth utilization. Sometimes the value exceeds 100% which is concerning but my first guess is that this is in case the filter fits into L2$ which has a significantly higher bandwidth compared to DRAM.

Aug 29 '24 23:08 sleeepyjack

I just looked into how we could support functional parity with parquet/impala/etc.

The key differences between all these implementations are:

how hash values are generated aka which hasher is used
how to determine which block a key is added into based on the hash value
how the bit pattern for each word in the block is generated from the hash value

These last two points are currently not accessible to the user as customization points.

We could introduce a new class tparam that takes a policy struct which could look like this:

struct parquet_filter_policy {
  template <class HashValue>
  __device__ auto block_idx(HashValue hash) const { /* use parquet impl */ }

  template <class HashValue>
  __device__ auto pattern_word(HashValue hash, uint32_t pattern_bits, uint32_t word_idx) { /* generates the bit pattern for block[word_idx] */ }
};

There are a few problems with this "lightweight" policy approach:

Both Parquet and Impala use a hardcoded hash function. Since the Hash function parameter is not part of the policy, it's users' responsibility to select the correct hasher for this policy.
Since the policy doesn't know about the extent_type::size_type, it has to assume the worst case and return a size_t which defies the purpose of the extent concept.

We could instead implement a "heavy" policy type that basically wraps most of the existing tparams of the bloom_filter class.

For the Parquet policy this would look like this (notice the hardcoded values):

template <class Key>
struct parquet_filter_policy {
  using key_type = Key;
  using hasher = cuco::xxhash_64<key_type>;  // hardcoded
  using hash_value_type = uint64_t;          // hardcoded
  using word_type = uint32_t;                // hardcoded

  static constexpr auto words_per_block = 8; // hardcoded

  __device__ hash_value_type hash(Key const& key) const { /* produce a hash value */ }

  template <class Extent>
  __device__ auto block_idx(hash_value_type hash, Extent num_blocks) const { /* use parquet impl */ }

  __device__ word_type pattern_word(hash_value_type hash, uint32_t pattern_bits, uint32_t word_idx) const { /* generates the bit pattern for block[word_idx] */ }
  
  hasher hash_;
};

Our current setup allows for more customization points and the policy could look like this:

template <class Key, 
          class Hash = cuco::default_hash_function<Key>, 
          class Block = cuda::std::array<uint32_t, 8>>
struct default_filter_policy {
  using key_type = Key;
  using hasher = Hash
  using hash_value_type = typename hasher::result_type;
  using word_type = typename Block::value_type;

  static constexpr auto words_per_block = cuda::std::tuple_size_v<Block>;

  __device__ default_filter_policy(Hash hash = {}) : hash_{hash} {}

  __device__ hash_value_type hash(Key const& key) const { return this->hash_(key); }

  template <class Extent>
  __device__ size_type block_idx(hash_value_type hash, Extent num_blocks) const { return hash % num_blocks; }

  __device__ word_type pattern_word(hash_value_type hash, uint32_t word_idx) const { /* ... */ }

  hasher hash_;
};

The filter class template would then look like this:

template <class Key,
          class Extent = cuco::extent<size_t>, 
          class Policy = default_filter_policy<Key>,
          class Allocator = cuco::cuda_allocator<byte>>
class bloom_filter { ... };

Is this design valid or can it be improved?

edit: removed Extent from policy

Sep 10 '24 23:09 sleeepyjack

Is this design valid or can it be improved?

One change I might make: To put extent_ in a policy seems unusual to me. I'd prefer it as a parameter to some_filter_policy::block_idx.

Another question: what is the pattern_bits parameter in parquet_filter_policy::pattern_word()?

Sep 10 '24 23:09 jbapple

One change I might make: To put extent_ in a policy seems unusual to me. I'd prefer it as a parameter to some_filter_policy::block_idx

Good point. The extent type is orthogonal to the policy. I have updated the example accordingly.

Another question: what is the pattern_bits parameter in parquet_filter_policy::pattern_word()?

That's the bit cardinality of a key's signature aka the number of bits set in a block. If we have N words per block then each word will have pattern_bits / N bits set for this key.

Sep 10 '24 23:09 sleeepyjack

Another question: what is the pattern_bits parameter in parquet_filter_policy::pattern_word()?

That's the bit cardinality of a key's signature aka the number of bits set in a block. If we have N words per block then each word will have pattern_bits / N bits set for this key.

So pattern_bits is equal to words_per_block which is equal to 8 in Parquet/Impala/etc, right? Maybe in the example that parameter should be on default_filter_policy::pattern_word() instead of parquet_filter_policy::pattern_word()?

Sep 11 '24 07:09 jbapple

So pattern_bits is equal to words_per_block which is equal to 8 in Parquet/Impala/etc, right? Maybe in the example that parameter should be on default_filter_policy::pattern_word() instead of parquet_filter_policy::pattern_word()?

Whups, yep, that was me being incapable of copy-pasting the parameter list. Sorry about that.

My initial idea was that the pattern_bits param is not part of the policy but rather the bloom_filter ctor. However, your comment made me think. For Parquet/Impala this param is fixed. Allowing it to be set to a different value than what the original implementation allows would be a proper footgun. So maybe it should be part of the policy instead. Here is the updated example:

template <class Key>
struct parquet_filter_policy {
  using key_type = Key;
  using hasher = cuco::xxhash_64<key_type>;  // hardcoded
  using hash_value_type = uint64_t;          // hardcoded
  using word_type = uint32_t;                // hardcoded

  static constexpr auto words_per_block = 8; // hardcoded
  /* static constexpr auto pattern_bits = words_per_block; // implicitly hardcoded */

  __device__ hash_value_type hash(Key const& key) const { /* produce a hash value */ }

  template <class Extent>
  __device__ auto block_idx(hash_value_type hash, Extent num_blocks) const { /* use parquet impl */ }

  __device__ word_type pattern_word(hash_value_type hash, uint32_t word_idx) const { /* generates the bit pattern for block[word_idx] */ }
  
  hasher hash_;
};

template <class Key, 
          class Hash = cuco::default_hash_function<Key>, 
          class Block = cuda::std::array<uint32_t, 8>>
struct default_filter_policy {
  using key_type = Key;
  using hasher = Hash
  using hash_value_type = typename hasher::result_type;
  using word_type = typename Block::value_type;

  static constexpr auto words_per_block = cuda::std::tuple_size_v<Block>;

  __device__ default_filter_policy(uint32_t pattern_bits = words_per_block, Hash hash = {}) : pattern_bits_{pattern_bits}, hash_{hash} {}

  __device__ hash_value_type hash(Key const& key) const { return this->hash_(key); }

  template <class Extent>
  __device__ auto block_idx(hash_value_type hash, Extent num_blocks) const { return hash % num_blocks; }

  __device__ word_type pattern_word(hash_value_type hash, uint32_t word_idx) const { /* ... */ }

  uint32_t pattern_bits_;
  hasher hash_;
};

Sep 11 '24 23:09 sleeepyjack

LGTM! :+1:

Sep 12 '24 00:09 jbapple

The PR is ready for another round of reviews (thanks in advance!). I addressed most of the previous suggestions and implemented the discussed policy concept.

Sep 18 '24 00:09 sleeepyjack

Forgot to post some final benchmark results:

# Benchmark Results

## bloom_filter_add_unique_size

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            1 |                 1 |   3.69% | 100.00% |    0.00% |     0.00% |    86.45% |      3x |     20x | 26.092 ms | 0.09% | 26.086 ms | 0.09% | 15.334G |     21x | 26.085 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            2 |                 1 |   3.69% | 100.00% |    0.00% |     0.00% |    86.45% |      3x |     20x | 26.090 ms | 0.05% | 26.085 ms | 0.04% | 15.335G |     21x | 26.085 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            4 |                 1 |   3.69% | 100.00% |    0.00% |     0.00% |    86.45% |      3x |     20x | 26.097 ms | 0.07% | 26.091 ms | 0.06% | 15.331G |     21x | 26.078 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            8 |                 1 |   3.70% | 100.00% |    0.00% |     0.00% |    86.43% |      3x |     20x | 26.097 ms | 0.09% | 26.091 ms | 0.09% | 15.331G |     21x | 26.079 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           16 |                 1 |   3.71% | 100.00% |    0.00% |     0.00% |    86.41% |      3x |     20x | 26.103 ms | 0.07% | 26.097 ms | 0.07% | 15.327G |     21x | 26.079 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           32 |           0.99997 |  12.81% | 100.00% |    0.00% |     0.00% |    85.57% |      3x |     19x | 26.896 ms | 0.05% | 26.890 ms | 0.05% | 14.875G |     20x | 26.876 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           64 |           0.99228 |  25.18% | 100.00% |    0.00% |     0.00% |    55.47% |      3x |     17x | 29.564 ms | 0.07% | 29.558 ms | 0.07% | 13.533G |     18x | 29.548 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          128 |           0.85593 |  29.53% | 100.00% |    0.00% |     0.00% |    41.13% |      3x |     16x | 31.391 ms | 0.04% | 31.385 ms | 0.04% | 12.745G |     17x | 31.375 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          256 |           0.47041 |  30.19% | 100.00% |    0.00% |     0.00% |    35.93% |      3x |     16x | 32.932 ms | 0.02% | 32.927 ms | 0.01% | 12.148G |     17x | 32.921 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          512 |           0.15322 |  30.11% | 100.00% |    0.00% |     0.00% |    33.68% |      3x |     15x | 33.999 ms | 0.03% | 33.994 ms | 0.02% | 11.767G |     16x | 33.986 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         1024 |          0.035256 |  29.98% | 100.00% |    0.00% |     0.00% |    32.62% |      3x |     15x | 34.610 ms | 0.02% | 34.604 ms | 0.02% | 11.559G |     16x | 34.601 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2048 |          0.012553 |  29.88% | 100.00% |    0.00% |     0.00% |    32.11% |      3x |     15x | 34.930 ms | 0.02% | 34.924 ms | 0.01% | 11.453G |     16x | 34.919 ms |

## bloom_filter_add_unique_hash

### [0] NVIDIA H100 80GB HBM3

| Key |     Hash      | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|---------------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | identity_hash |  U32 |             8 |       UNIQUE | 400000000 |         2000 |                 0 |  30.84% | 100.00% |    0.00% |     0.00% |    29.35% |      3x |     15x | 33.856 ms | 0.02% | 33.851 ms | 0.01% | 11.817G |     16x | 33.847 ms |
| I64 |     xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  29.89% | 100.00% |    0.00% |     0.00% |    32.12% |      3x |     15x | 34.927 ms | 0.02% | 34.921 ms | 0.02% | 11.454G |     16x | 34.915 ms |

## bloom_filter_add_unique_block_dim

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             1 |       UNIQUE | 400000000 |         2000 |           0.53373 |  33.84% |  50.00% |    0.00% |    16.59% |    27.51% |      3x |     17x | 30.846 ms | 0.02% | 30.841 ms | 0.01% | 12.970G |     18x | 30.837 ms |
| I64 | xxhash_64 |  U32 |             2 |       UNIQUE | 400000000 |         2000 |           0.78258 |  33.78% | 100.00% |    0.00% |     0.00% |    27.39% |      3x |     17x | 30.886 ms | 0.02% | 30.881 ms | 0.01% | 12.953G |     18x | 30.880 ms |
| I64 | xxhash_64 |  U32 |             4 |       UNIQUE | 400000000 |         2000 |           0.95274 |  33.51% | 100.00% |    0.00% |     0.00% |    27.23% |      3x |     17x | 31.140 ms | 0.02% | 31.135 ms | 0.01% | 12.847G |     18x | 31.134 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  29.89% | 100.00% |    0.00% |     0.00% |    32.13% |      3x |     15x | 34.920 ms | 0.02% | 34.914 ms | 0.02% | 11.457G |     16x | 34.910 ms |
| I64 | xxhash_64 |  U64 |             1 |       UNIQUE | 400000000 |         2000 |           0.78258 |  33.83% |  50.00% |    0.00% |    16.58% |    27.51% |      3x |     17x | 30.843 ms | 0.02% | 30.838 ms | 0.01% | 12.971G |     18x | 30.834 ms |
| I64 | xxhash_64 |  U64 |             2 |       UNIQUE | 400000000 |         2000 |           0.95274 |  33.79% | 100.00% |    0.00% |     0.00% |    27.39% |      3x |     17x | 30.891 ms | 0.02% | 30.885 ms | 0.01% | 12.951G |     18x | 30.884 ms |
| I64 | xxhash_64 |  U64 |             4 |       UNIQUE | 400000000 |         2000 |           0.17364 |  33.52% | 100.00% |    0.00% |     0.00% |    27.22% |      3x |     17x | 31.138 ms | 0.02% | 31.132 ms | 0.01% | 12.848G |     18x | 31.131 ms |
| I64 | xxhash_64 |  U64 |             8 |       UNIQUE | 400000000 |         2000 |        0.00027879 |  41.92% | 100.00% |    0.00% |     0.00% |    32.92% |      3x |     14x | 38.398 ms | 0.02% | 38.392 ms | 0.01% | 10.419G |     15x | 38.387 ms |

## bloom_filter_contains_unique_size

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            1 |                 1 |  21.05% |  55.56% |  100.00% |    47.39% |    72.33% |      3x |     98x |  5.130 ms | 0.19% |  5.125 ms | 0.15% | 78.056G |    102x |  5.120 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            2 |                 1 |  21.05% |  55.56% |  100.00% |    45.97% |    73.02% |      3x |     98x |  5.129 ms | 0.12% |  5.123 ms | 0.06% | 78.073G |    102x |  5.120 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            4 |                 1 |  21.04% |  55.56% |  100.00% |    45.33% |    73.32% |      3x |     98x |  5.130 ms | 0.12% |  5.124 ms | 0.05% | 78.058G |    102x |  5.121 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |            8 |                 1 |  21.08% |  55.56% |  100.00% |    45.03% |    73.45% |      3x |     98x |  5.132 ms | 0.13% |  5.126 ms | 0.07% | 78.033G |    102x |  5.120 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           16 |                 1 |  22.74% |  55.56% |  100.00% |    44.88% |    72.28% |      3x |     98x |  5.141 ms | 0.13% |  5.135 ms | 0.07% | 77.890G |    102x |  5.127 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           32 |           0.99997 |  65.42% |  55.56% |  100.00% |    44.81% |    42.71% |      3x |     91x |  5.514 ms | 0.11% |  5.509 ms | 0.05% | 72.610G |     95x |  5.555 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |           64 |           0.99228 |  70.30% |  55.56% |  100.00% |    44.72% |    18.46% |      3x |     55x |  9.176 ms | 0.07% |  9.170 ms | 0.03% | 43.620G |     57x |  9.159 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          128 |           0.85593 |  66.46% |  55.56% |  100.00% |    44.69% |     9.75% |      3x |     44x | 11.554 ms | 0.05% | 11.548 ms | 0.02% | 34.638G |     45x | 11.537 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          256 |           0.47041 |  64.21% |  55.56% |  100.00% |    44.67% |     9.20% |      3x |     40x | 12.800 ms | 0.05% | 12.794 ms | 0.01% | 31.264G |     41x | 12.785 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |          512 |           0.15322 |  63.04% |  55.56% |  100.00% |    44.66% |    10.32% |      3x |     38x | 13.436 ms | 0.04% | 13.430 ms | 0.02% | 29.784G |     39x | 13.421 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         1024 |          0.035256 |  62.46% |  55.56% |  100.00% |    44.66% |    11.13% |      3x |     37x | 13.756 ms | 0.05% | 13.750 ms | 0.02% | 29.091G |     38x | 13.741 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2048 |          0.012553 |  62.19% |  55.56% |  100.00% |    44.66% |    11.59% |      3x |     36x | 13.916 ms | 0.04% | 13.911 ms | 0.02% | 28.755G |     37x | 13.903 ms |

## bloom_filter_contains_unique_hash

### [0] NVIDIA H100 80GB HBM3

| Key |     Hash      | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|---------------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | identity_hash |  U32 |             8 |       UNIQUE | 400000000 |         2000 |                 0 |  62.18% |  55.56% |  100.00% |    44.64% |    11.53% |      3x |     36x | 13.920 ms | 0.04% | 13.914 ms | 0.02% | 28.747G |     37x | 13.905 ms |
| I64 |     xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  62.15% |  55.56% |  100.00% |    44.64% |    11.51% |      3x |     36x | 13.913 ms | 0.04% | 13.907 ms | 0.02% | 28.762G |     37x | 13.898 ms |

## bloom_filter_contains_unique_block_dim

### [0] NVIDIA H100 80GB HBM3

| Key |   Hash    | Word | WordsPerBlock | Distribution | NumInputs | FilterSizeMB | FalsePositiveRate | HBWPeak | LoadEff | StoreEff | L1HitRate | L2HitRate | Samples | Samples | CPU Time  | Noise | GPU Time  | Noise | Elem/s  | Samples | Batch GPU |
|-----|-----------|------|---------------|--------------|-----------|--------------|-------------------|---------|---------|----------|-----------|-----------|---------|---------|-----------|-------|-----------|-------|---------|---------|-----------|
| I64 | xxhash_64 |  U32 |             1 |       UNIQUE | 400000000 |         2000 |           0.53373 |  62.17% |  30.00% |  100.00% |     1.73% |    11.62% |      3x |     36x | 13.916 ms | 0.05% | 13.911 ms | 0.02% | 28.755G |     37x | 13.902 ms |
| I64 | xxhash_64 |  U32 |             2 |       UNIQUE | 400000000 |         2000 |           0.78258 |  62.17% |  40.00% |  100.00% |     1.74% |    11.61% |      3x |     36x | 13.913 ms | 0.05% | 13.908 ms | 0.02% | 28.761G |     37x | 13.899 ms |
| I64 | xxhash_64 |  U32 |             4 |       UNIQUE | 400000000 |         2000 |           0.95274 |  62.14% |  60.00% |  100.00% |     1.74% |    11.54% |      3x |     36x | 13.915 ms | 0.05% | 13.909 ms | 0.02% | 28.758G |     37x | 13.900 ms |
| I64 | xxhash_64 |  U32 |             8 |       UNIQUE | 400000000 |         2000 |         0.0017157 |  62.17% |  55.56% |  100.00% |    44.65% |    11.52% |      3x |     36x | 13.912 ms | 0.04% | 13.906 ms | 0.01% | 28.764G |     37x | 13.898 ms |
| I64 | xxhash_64 |  U64 |             1 |       UNIQUE | 400000000 |         2000 |           0.78258 |  62.18% |  40.00% |  100.00% |     1.74% |    11.62% |      3x |     36x | 13.916 ms | 0.05% | 13.910 ms | 0.03% | 28.756G |     37x | 13.903 ms |
| I64 | xxhash_64 |  U64 |             2 |       UNIQUE | 400000000 |         2000 |           0.95274 |  62.20% |  60.00% |  100.00% |     1.75% |    11.66% |      3x |     36x | 13.913 ms | 0.04% | 13.908 ms | 0.02% | 28.761G |     37x | 13.897 ms |
| I64 | xxhash_64 |  U64 |             4 |       UNIQUE | 400000000 |         2000 |           0.17364 |  62.14% |  55.56% |  100.00% |    44.63% |    11.64% |      3x |     36x | 13.913 ms | 0.04% | 13.907 ms | 0.01% | 28.762G |     37x | 13.898 ms |
| I64 | xxhash_64 |  U64 |             8 |       UNIQUE | 400000000 |         2000 |        0.00027879 |  60.65% |  52.94% |   50.00% |    47.42% |     8.38% |      3x |     36x | 14.264 ms | 0.04% | 14.258 ms | 0.01% | 28.054G |     37x | 14.249 ms |

cuco_bf_h100.csv

(includes changes from #609)

Oct 02 '24 23:10 sleeepyjack

cuco::bloom_filter (new)

TODOs: