cuCollections cuco::bloom

cuco::bloom_filter

Open sleeepyjack opened this issue 2 years ago • 16 comments

Adds a new class called cuco::bloom_filter for approximate set membership queries.

It is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed; the more items added, the larger the probability of false positives.

The type of implementation used here is known as a "partitioned" or "pattern-blocked" bloom filter.

This PR comes with examples, benchmarks, as well as unit tests.

Aug 09 '21 01:08 sleeepyjack

Can one of the admins verify this patch?

Aug 09 '21 01:08 GPUtester

ok to test

Aug 18 '21 19:08 sleeepyjack

Meh, forgot that I don't have the permissions to fire up the CI.

This PR is ready to test and ready for review.

Aug 18 '21 19:08 sleeepyjack

add to whitelist

Aug 19 '21 17:08 jrhemstad

okay to test

Aug 19 '21 17:08 jrhemstad

ok to test

Aug 19 '21 17:08 jrhemstad

add to whitelist

Aug 19 '21 17:08 dillon-cullinan

rerun tests

Aug 19 '21 19:08 sleeepyjack

rerun tests

Aug 24 '21 19:08 jrhemstad

@sleeepyjack can you resolve conflicts?

Nov 01 '21 16:11 jrhemstad

@sleeepyjack can you resolve conflicts?

on it

Nov 14 '21 13:11 sleeepyjack

Now revisiting this PR and giving it some love so we can get it merged.

Jul 13 '22 14:07 sleeepyjack

what whaaat? The ubuntu18.04 runner is complaining. Investigating...


/workspace/benchmarks/bloom_filter/bloom_filter_bench.cu:107:1: note: candidate: template<class Key, class Slot, filter_op Op, filter_scope FilterScope, data_scope DataScope> void nvbench_cuco_bloom_filter(nvbench::state&, nvbench::type_list<Key, Slot, nvbench::enum_type<Op, decltype (Op)>, nvbench::enum_type<FilterScope, decltype (FilterScope)>, nvbench::enum_type<DataScope, decltype (DataScope)> >)
 void nvbench_cuco_bloom_filter(nvbench::state& state,
 ^~~~~~~~~~~~~~~~~~~~~~~~~
/workspace/benchmarks/bloom_filter/bloom_filter_bench.cu:107:1: note:   template argument deduction/substitution failed:
/workspace/benchmarks/bloom_filter/bloom_filter_bench.cu:277:163: note:   mismatched types ‘data_scope’ and ‘filter_op’
 NVBENCH_BENCH_TYPES(nvbench_cuco_bloom_filter,

Jul 13 '22 15:07 sleeepyjack

I tinkered a bit more with different bit-pattern generators with the goal of improving the FPR and performance (there's obviously a trade-off between the two).

Here are the results:

plain double hashing h1(k) + i * h2(k) cost: evaluate two hash functions + N times the second term FPR: 0.0264
triple hashing h1(k) + i * h2(k) + i*i * h3(k) cost: evaluate three hash functions + N times the last two terms FPR: 0.0204
extended double hashing h1(k) + i * h2(k) + ((i*i*i - i)/6) cost: evaluate two hash functions + N times the last two terms FPR: 0.0236
best FPR approach h1(k) + h2(k + i) cost: evaluate h1 once and h2 N times (very expensive for long keys) FPR: 0.0194 restriction: operator+ for the term k+i must be available

Given these results, I think I will go with the extended double hashing or triple hashing approach then.

Jul 18 '22 12:07 sleeepyjack

Bildschirmfoto 2022-07-29 um 14 21 22

Wider Slot types result in a better FPR. However, since cuda::atomic<__int128_t>::is_lock_free() == false, the query throughput drops drastically.

Jul 29 '22 13:07 sleeepyjack

I'm dropping the cuda::annotated_ptr/cuda::apply_access_policy strategy as the access policy is apparently not applied correctly (virtual no performance difference between L2-persistent and non-persistent filters). Thus, I'm rolling back to the old strategy, i.e., using the CUDA driver API.

Here are some benchmark results on A100 80GB L2-resident vs. non-resident filter:

KeyType	SlotType	FilterOperation	FilterScope	DataScope	NumInputs	NumBits	NumHashes	nv/filter/fpr	nv/filter/size/mb	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	GlobalMem BW	BWUtil	Samples	Batch GPU
I32	U64	INSERT	GMEM	GMEM	10000000	300000000	2	0.0059597	37	773x	656.308 us	1.37%	647.584 us	0.22%	15.442G	123.536 GB/s	6.38%	814x	641.942 us
I32	U64	INSERT	GMEM	GMEM	100000000	300000000	2	0.24194	37	81x	6.194 ms	1.50%	6.185 ms	1.49%	16.168G	129.345 GB/s	6.68%	85x	6.163 ms
I32	U64	INSERT	GMEM	REGS	10000000	300000000	2	0.0059597	37	1289x	396.613 us	2.47%	387.908 us	1.02%	25.779G	206.235 GB/s	10.66%	1374x	380.201 us
I32	U64	INSERT	GMEM	REGS	100000000	300000000	2	0.24194	37	171x	2.940 ms	0.87%	2.932 ms	0.81%	34.111G	272.888 GB/s	14.10%	180x	2.940 ms
I32	U64	INSERT	L2	GMEM	10000000	300000000	2	0.0059597	37	1819x	283.593 us	3.21%	274.877 us	0.51%	36.380G	291.039 GB/s	15.04%	1896x	269.990 us
I32	U64	INSERT	L2	GMEM	100000000	300000000	2	0.24194	37	201x	2.505 ms	0.74%	2.496 ms	0.64%	40.060G	320.483 GB/s	16.56%	202x	2.519 ms
I32	U64	INSERT	L2	REGS	10000000	300000000	2	0.0059597	37	1951x	265.059 us	3.47%	256.315 us	0.62%	39.014G	312.115 GB/s	16.13%	2031x	251.270 us
I32	U64	INSERT	L2	REGS	100000000	300000000	2	0.24194	37	217x	2.316 ms	0.39%	2.307 ms	0.04%	43.341G	346.728 GB/s	17.92%	227x	2.302 ms
I32	U64	CONTAINS	GMEM	GMEM	10000000	300000000	2	0.0059597	37	1793x	287.897 us	3.22%	278.999 us	0.46%	35.842G	286.740 GB/s	14.82%	1906x	262.407 us
I32	U64	CONTAINS	GMEM	GMEM	100000000	300000000	2	0.24194	37	192x	2.621 ms	0.64%	2.612 ms	0.54%	38.282G	306.254 GB/s	15.82%	199x	2.617 ms
I32	U64	CONTAINS	GMEM	REGS	10000000	300000000	2	0.0059597	37	1831x	282.127 us	3.39%	273.182 us	0.85%	36.606G	292.845 GB/s	15.13%	1946x	258.885 us
I32	U64	CONTAINS	GMEM	REGS	100000000	300000000	2	0.24194	37	197x	2.552 ms	0.38%	2.543 ms	0.13%	39.323G	314.587 GB/s	16.25%	207x	2.528 ms
I32	U64	CONTAINS	L2	GMEM	10000000	300000000	2	0.0059597	37	1873x	276.017 us	3.42%	266.967 us	0.43%	37.458G	299.662 GB/s	15.48%	1961x	260.320 us
I32	U64	CONTAINS	L2	GMEM	100000000	300000000	2	0.24194	37	194x	2.593 ms	0.72%	2.584 ms	0.63%	38.706G	309.651 GB/s	16.00%	196x	2.599 ms
I32	U64	CONTAINS	L2	REGS	10000000	300000000	2	0.0059597	37	1891x	273.571 us	3.51%	264.540 us	0.81%	37.802G	302.412 GB/s	15.63%	1939x	258.965 us
I32	U64	CONTAINS	L2	REGS	100000000	300000000	2	0.24194	37	198x	2.543 ms	0.36%	2.534 ms	0.06%	39.458G	315.667 GB/s	16.31%	204x	2.529 ms

Aug 01 '22 17:08 sleeepyjack

cuCollections cuCollections copied to clipboard

cuco::bloom_filter

cuCollections
cuCollections copied to clipboard