pybloof icon indicating copy to clipboard operation
pybloof copied to clipboard

Enable cythonized range contains

Open cemoody opened this issue 7 years ago • 2 comments

The standard contains function in pybloof checks for a single element. If you'd like to check a range of elements, this is python-limited and slow, so this PR is for a cythonized "range contains".

cemoody avatar Mar 30 '18 15:03 cemoody

Curious as to what’s the use case here.

jhgg avatar Mar 31 '18 23:03 jhgg

Currently, you can only check one item at a time. This PR effectively decompresses the bloom filter and turns it into a set.

Specifically for me, it means I can filter a Pandas DataFrame using its internal isin operation.

For example:

f = UIntBloomFilter.from_base64(b64.encode('ascii'))
_, uniques = f.uniques_in_range(0, df.id.max())
flags = df.id.isin(uniques)

...which is 10x faster than doing things item-by-item:

f = UIntBloomFilter.from_base64(b64.encode('ascii'))
flags = [idx in f for idx in df.id]

Not sure if this actually merits merging upstream. If it's not, no worries, figured I'd contribute since I had it around anyway.

cemoody avatar Apr 01 '18 02:04 cemoody