pybloof
pybloof copied to clipboard
Enable cythonized range contains
The standard contains function in pybloof checks for a single element. If you'd like to check a range of elements, this is python-limited and slow, so this PR is for a cythonized "range contains".
Curious as to what’s the use case here.
Currently, you can only check one item at a time. This PR effectively decompresses the bloom filter and turns it into a set.
Specifically for me, it means I can filter a Pandas DataFrame using its internal isin operation.
For example:
f = UIntBloomFilter.from_base64(b64.encode('ascii'))
_, uniques = f.uniques_in_range(0, df.id.max())
flags = df.id.isin(uniques)
...which is 10x faster than doing things item-by-item:
f = UIntBloomFilter.from_base64(b64.encode('ascii'))
flags = [idx in f for idx in df.id]
Not sure if this actually merits merging upstream. If it's not, no worries, figured I'd contribute since I had it around anyway.