cuCollections
cuCollections copied to clipboard
cuco::bloom_filter
Adds a new class called cuco::bloom_filter
for approximate set membership queries.
It is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed; the more items added, the larger the probability of false positives.
The type of implementation used here is known as a "partitioned" or "pattern-blocked" bloom filter.
This PR comes with examples, benchmarks, as well as unit tests.
Can one of the admins verify this patch?
ok to test
Meh, forgot that I don't have the permissions to fire up the CI.
This PR is ready to test and ready for review.
add to whitelist
okay to test
ok to test
add to whitelist
rerun tests
rerun tests
@sleeepyjack can you resolve conflicts?
@sleeepyjack can you resolve conflicts?
on it
Now revisiting this PR and giving it some love so we can get it merged.
what whaaat? The ubuntu18.04 runner is complaining. Investigating...
/workspace/benchmarks/bloom_filter/bloom_filter_bench.cu:107:1: note: candidate: template<class Key, class Slot, filter_op Op, filter_scope FilterScope, data_scope DataScope> void nvbench_cuco_bloom_filter(nvbench::state&, nvbench::type_list<Key, Slot, nvbench::enum_type<Op, decltype (Op)>, nvbench::enum_type<FilterScope, decltype (FilterScope)>, nvbench::enum_type<DataScope, decltype (DataScope)> >)
void nvbench_cuco_bloom_filter(nvbench::state& state,
^~~~~~~~~~~~~~~~~~~~~~~~~
/workspace/benchmarks/bloom_filter/bloom_filter_bench.cu:107:1: note: template argument deduction/substitution failed:
/workspace/benchmarks/bloom_filter/bloom_filter_bench.cu:277:163: note: mismatched types ‘data_scope’ and ‘filter_op’
NVBENCH_BENCH_TYPES(nvbench_cuco_bloom_filter,
I tinkered a bit more with different bit-pattern generators with the goal of improving the FPR and performance (there's obviously a trade-off between the two).
Here are the results:
-
plain double hashing
h1(k) + i * h2(k)
cost: evaluate two hash functions + N times the second term FPR: 0.0264 -
triple hashing
h1(k) + i * h2(k) + i*i * h3(k)
cost: evaluate three hash functions + N times the last two terms FPR: 0.0204 -
extended double hashing
h1(k) + i * h2(k) + ((i*i*i - i)/6)
cost: evaluate two hash functions + N times the last two terms FPR: 0.0236 -
best FPR approach
h1(k) + h2(k + i)
cost: evaluate h1 once and h2 N times (very expensive for long keys) FPR: 0.0194 restriction: operator+ for the term k+i must be available
Given these results, I think I will go with the extended double hashing or triple hashing approach then.
Wider Slot
types result in a better FPR. However, since cuda::atomic<__int128_t>::is_lock_free() == false
, the query throughput drops drastically.
I'm dropping the cuda::annotated_ptr
/cuda::apply_access_policy
strategy as the access policy is apparently not applied correctly (virtual no performance difference between L2-persistent and non-persistent filters). Thus, I'm rolling back to the old strategy, i.e., using the CUDA driver API.
Here are some benchmark results on A100 80GB L2-resident vs. non-resident filter:
KeyType | SlotType | FilterOperation | FilterScope | DataScope | NumInputs | NumBits | NumHashes | nv/filter/fpr | nv/filter/size/mb | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | Samples | Batch GPU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
I32 | U64 | INSERT | GMEM | GMEM | 10000000 | 300000000 | 2 | 0.0059597 | 37 | 773x | 656.308 us | 1.37% | 647.584 us | 0.22% | 15.442G | 123.536 GB/s | 6.38% | 814x | 641.942 us |
I32 | U64 | INSERT | GMEM | GMEM | 100000000 | 300000000 | 2 | 0.24194 | 37 | 81x | 6.194 ms | 1.50% | 6.185 ms | 1.49% | 16.168G | 129.345 GB/s | 6.68% | 85x | 6.163 ms |
I32 | U64 | INSERT | GMEM | REGS | 10000000 | 300000000 | 2 | 0.0059597 | 37 | 1289x | 396.613 us | 2.47% | 387.908 us | 1.02% | 25.779G | 206.235 GB/s | 10.66% | 1374x | 380.201 us |
I32 | U64 | INSERT | GMEM | REGS | 100000000 | 300000000 | 2 | 0.24194 | 37 | 171x | 2.940 ms | 0.87% | 2.932 ms | 0.81% | 34.111G | 272.888 GB/s | 14.10% | 180x | 2.940 ms |
I32 | U64 | INSERT | L2 | GMEM | 10000000 | 300000000 | 2 | 0.0059597 | 37 | 1819x | 283.593 us | 3.21% | 274.877 us | 0.51% | 36.380G | 291.039 GB/s | 15.04% | 1896x | 269.990 us |
I32 | U64 | INSERT | L2 | GMEM | 100000000 | 300000000 | 2 | 0.24194 | 37 | 201x | 2.505 ms | 0.74% | 2.496 ms | 0.64% | 40.060G | 320.483 GB/s | 16.56% | 202x | 2.519 ms |
I32 | U64 | INSERT | L2 | REGS | 10000000 | 300000000 | 2 | 0.0059597 | 37 | 1951x | 265.059 us | 3.47% | 256.315 us | 0.62% | 39.014G | 312.115 GB/s | 16.13% | 2031x | 251.270 us |
I32 | U64 | INSERT | L2 | REGS | 100000000 | 300000000 | 2 | 0.24194 | 37 | 217x | 2.316 ms | 0.39% | 2.307 ms | 0.04% | 43.341G | 346.728 GB/s | 17.92% | 227x | 2.302 ms |
I32 | U64 | CONTAINS | GMEM | GMEM | 10000000 | 300000000 | 2 | 0.0059597 | 37 | 1793x | 287.897 us | 3.22% | 278.999 us | 0.46% | 35.842G | 286.740 GB/s | 14.82% | 1906x | 262.407 us |
I32 | U64 | CONTAINS | GMEM | GMEM | 100000000 | 300000000 | 2 | 0.24194 | 37 | 192x | 2.621 ms | 0.64% | 2.612 ms | 0.54% | 38.282G | 306.254 GB/s | 15.82% | 199x | 2.617 ms |
I32 | U64 | CONTAINS | GMEM | REGS | 10000000 | 300000000 | 2 | 0.0059597 | 37 | 1831x | 282.127 us | 3.39% | 273.182 us | 0.85% | 36.606G | 292.845 GB/s | 15.13% | 1946x | 258.885 us |
I32 | U64 | CONTAINS | GMEM | REGS | 100000000 | 300000000 | 2 | 0.24194 | 37 | 197x | 2.552 ms | 0.38% | 2.543 ms | 0.13% | 39.323G | 314.587 GB/s | 16.25% | 207x | 2.528 ms |
I32 | U64 | CONTAINS | L2 | GMEM | 10000000 | 300000000 | 2 | 0.0059597 | 37 | 1873x | 276.017 us | 3.42% | 266.967 us | 0.43% | 37.458G | 299.662 GB/s | 15.48% | 1961x | 260.320 us |
I32 | U64 | CONTAINS | L2 | GMEM | 100000000 | 300000000 | 2 | 0.24194 | 37 | 194x | 2.593 ms | 0.72% | 2.584 ms | 0.63% | 38.706G | 309.651 GB/s | 16.00% | 196x | 2.599 ms |
I32 | U64 | CONTAINS | L2 | REGS | 10000000 | 300000000 | 2 | 0.0059597 | 37 | 1891x | 273.571 us | 3.51% | 264.540 us | 0.81% | 37.802G | 302.412 GB/s | 15.63% | 1939x | 258.965 us |
I32 | U64 | CONTAINS | L2 | REGS | 100000000 | 300000000 | 2 | 0.24194 | 37 | 198x | 2.543 ms | 0.36% | 2.534 ms | 0.06% | 39.458G | 315.667 GB/s | 16.31% | 204x | 2.529 ms |