cuCollections icon indicating copy to clipboard operation
cuCollections copied to clipboard

cuco::bloom_filter

Open sleeepyjack opened this issue 2 years ago • 16 comments

Adds a new class called cuco::bloom_filter for approximate set membership queries.

It is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed; the more items added, the larger the probability of false positives.

The type of implementation used here is known as a "partitioned" or "pattern-blocked" bloom filter.

This PR comes with examples, benchmarks, as well as unit tests.

sleeepyjack avatar Aug 09 '21 01:08 sleeepyjack

Can one of the admins verify this patch?

GPUtester avatar Aug 09 '21 01:08 GPUtester

ok to test

sleeepyjack avatar Aug 18 '21 19:08 sleeepyjack

Meh, forgot that I don't have the permissions to fire up the CI.

This PR is ready to test and ready for review.

sleeepyjack avatar Aug 18 '21 19:08 sleeepyjack

add to whitelist

jrhemstad avatar Aug 19 '21 17:08 jrhemstad

okay to test

jrhemstad avatar Aug 19 '21 17:08 jrhemstad

ok to test

jrhemstad avatar Aug 19 '21 17:08 jrhemstad

add to whitelist

dillon-cullinan avatar Aug 19 '21 17:08 dillon-cullinan

rerun tests

sleeepyjack avatar Aug 19 '21 19:08 sleeepyjack

rerun tests

jrhemstad avatar Aug 24 '21 19:08 jrhemstad

@sleeepyjack can you resolve conflicts?

jrhemstad avatar Nov 01 '21 16:11 jrhemstad

@sleeepyjack can you resolve conflicts?

on it

sleeepyjack avatar Nov 14 '21 13:11 sleeepyjack

Now revisiting this PR and giving it some love so we can get it merged.

sleeepyjack avatar Jul 13 '22 14:07 sleeepyjack

what whaaat? The ubuntu18.04 runner is complaining. Investigating...


/workspace/benchmarks/bloom_filter/bloom_filter_bench.cu:107:1: note: candidate: template<class Key, class Slot, filter_op Op, filter_scope FilterScope, data_scope DataScope> void nvbench_cuco_bloom_filter(nvbench::state&, nvbench::type_list<Key, Slot, nvbench::enum_type<Op, decltype (Op)>, nvbench::enum_type<FilterScope, decltype (FilterScope)>, nvbench::enum_type<DataScope, decltype (DataScope)> >)
 void nvbench_cuco_bloom_filter(nvbench::state& state,
 ^~~~~~~~~~~~~~~~~~~~~~~~~
/workspace/benchmarks/bloom_filter/bloom_filter_bench.cu:107:1: note:   template argument deduction/substitution failed:
/workspace/benchmarks/bloom_filter/bloom_filter_bench.cu:277:163: note:   mismatched types ‘data_scope’ and ‘filter_op’
 NVBENCH_BENCH_TYPES(nvbench_cuco_bloom_filter,

sleeepyjack avatar Jul 13 '22 15:07 sleeepyjack

I tinkered a bit more with different bit-pattern generators with the goal of improving the FPR and performance (there's obviously a trade-off between the two).

Here are the results:

  • plain double hashing h1(k) + i * h2(k) cost: evaluate two hash functions + N times the second term FPR: 0.0264

  • triple hashing h1(k) + i * h2(k) + i*i * h3(k) cost: evaluate three hash functions + N times the last two terms FPR: 0.0204

  • extended double hashing h1(k) + i * h2(k) + ((i*i*i - i)/6) cost: evaluate two hash functions + N times the last two terms FPR: 0.0236

  • best FPR approach h1(k) + h2(k + i) cost: evaluate h1 once and h2 N times (very expensive for long keys) FPR: 0.0194 restriction: operator+ for the term k+i must be available

Given these results, I think I will go with the extended double hashing or triple hashing approach then.

sleeepyjack avatar Jul 18 '22 12:07 sleeepyjack

Bildschirmfoto 2022-07-29 um 14 21 22

Wider Slot types result in a better FPR. However, since cuda::atomic<__int128_t>::is_lock_free() == false, the query throughput drops drastically.

sleeepyjack avatar Jul 29 '22 13:07 sleeepyjack

I'm dropping the cuda::annotated_ptr/cuda::apply_access_policy strategy as the access policy is apparently not applied correctly (virtual no performance difference between L2-persistent and non-persistent filters). Thus, I'm rolling back to the old strategy, i.e., using the CUDA driver API.

Here are some benchmark results on A100 80GB L2-resident vs. non-resident filter:

KeyType SlotType FilterOperation FilterScope DataScope NumInputs NumBits NumHashes nv/filter/fpr nv/filter/size/mb Samples CPU Time Noise GPU Time Noise Elem/s GlobalMem BW BWUtil Samples Batch GPU
I32 U64 INSERT GMEM GMEM 10000000 300000000 2 0.0059597 37 773x 656.308 us 1.37% 647.584 us 0.22% 15.442G 123.536 GB/s 6.38% 814x 641.942 us
I32 U64 INSERT GMEM GMEM 100000000 300000000 2 0.24194 37 81x 6.194 ms 1.50% 6.185 ms 1.49% 16.168G 129.345 GB/s 6.68% 85x 6.163 ms
I32 U64 INSERT GMEM REGS 10000000 300000000 2 0.0059597 37 1289x 396.613 us 2.47% 387.908 us 1.02% 25.779G 206.235 GB/s 10.66% 1374x 380.201 us
I32 U64 INSERT GMEM REGS 100000000 300000000 2 0.24194 37 171x 2.940 ms 0.87% 2.932 ms 0.81% 34.111G 272.888 GB/s 14.10% 180x 2.940 ms
I32 U64 INSERT L2 GMEM 10000000 300000000 2 0.0059597 37 1819x 283.593 us 3.21% 274.877 us 0.51% 36.380G 291.039 GB/s 15.04% 1896x 269.990 us
I32 U64 INSERT L2 GMEM 100000000 300000000 2 0.24194 37 201x 2.505 ms 0.74% 2.496 ms 0.64% 40.060G 320.483 GB/s 16.56% 202x 2.519 ms
I32 U64 INSERT L2 REGS 10000000 300000000 2 0.0059597 37 1951x 265.059 us 3.47% 256.315 us 0.62% 39.014G 312.115 GB/s 16.13% 2031x 251.270 us
I32 U64 INSERT L2 REGS 100000000 300000000 2 0.24194 37 217x 2.316 ms 0.39% 2.307 ms 0.04% 43.341G 346.728 GB/s 17.92% 227x 2.302 ms
I32 U64 CONTAINS GMEM GMEM 10000000 300000000 2 0.0059597 37 1793x 287.897 us 3.22% 278.999 us 0.46% 35.842G 286.740 GB/s 14.82% 1906x 262.407 us
I32 U64 CONTAINS GMEM GMEM 100000000 300000000 2 0.24194 37 192x 2.621 ms 0.64% 2.612 ms 0.54% 38.282G 306.254 GB/s 15.82% 199x 2.617 ms
I32 U64 CONTAINS GMEM REGS 10000000 300000000 2 0.0059597 37 1831x 282.127 us 3.39% 273.182 us 0.85% 36.606G 292.845 GB/s 15.13% 1946x 258.885 us
I32 U64 CONTAINS GMEM REGS 100000000 300000000 2 0.24194 37 197x 2.552 ms 0.38% 2.543 ms 0.13% 39.323G 314.587 GB/s 16.25% 207x 2.528 ms
I32 U64 CONTAINS L2 GMEM 10000000 300000000 2 0.0059597 37 1873x 276.017 us 3.42% 266.967 us 0.43% 37.458G 299.662 GB/s 15.48% 1961x 260.320 us
I32 U64 CONTAINS L2 GMEM 100000000 300000000 2 0.24194 37 194x 2.593 ms 0.72% 2.584 ms 0.63% 38.706G 309.651 GB/s 16.00% 196x 2.599 ms
I32 U64 CONTAINS L2 REGS 10000000 300000000 2 0.0059597 37 1891x 273.571 us 3.51% 264.540 us 0.81% 37.802G 302.412 GB/s 15.63% 1939x 258.965 us
I32 U64 CONTAINS L2 REGS 100000000 300000000 2 0.24194 37 198x 2.543 ms 0.36% 2.534 ms 0.06% 39.458G 315.667 GB/s 16.31% 204x 2.529 ms

sleeepyjack avatar Aug 01 '22 17:08 sleeepyjack