python-bloomfilter
python-bloomfilter copied to clipboard
Added nstar and nstar_intersection functions as well as tests.
I've implemented two functions nstar
and nstar_intersection
that estimate the number of elements in a Bloom filter as well as in an intersection of two Bloom filters (a separate function is required for the latter case since using nstar()
on the intersection directly would yield a wrong result). This is very useful in many circumstances, e.g. when performing similarity searches on hierarchical data structures.
The implementation follows the Wikipedia article:
http://en.wikipedia.org/wiki/Bloom_filter
I've added tests and a short explanation as well, please let me know if you need anything else in order to merge this.
btw @jaybaird thanks for this project, it's really useful :)
Andreas,
Maybe I'm misunderstanding what this does, but why would you use this over using len? Or in the case of the intersection, doing the intersection and taking the len() there?
I do see now actually that we don't update the count when we union/intersect filters, which is a bug and might better be solved by this. Can you point me to the relevant section in the Wikipedia article you're referencing so I can catch up?
Thanks!
Nevermind, I found it :)
Hey @jaybaird, the purpose of nstar
and nstar_intersection
is to provide an estimate of the number of elements in the filter, which is useful if you reconstruct the Bloom filter from the bit array alone (and thus not have access to the count
attribute). Also, obtaining a count for an intersection or union of two Bloom filters is not possible using the count
attribute, so we need to estimate it.
@jaybaird is there anything you need from my side to merge this PR or come to a decision on whether to integrate this into the code base?
I plan on taking a look at this this weekend. I'm still not 100% sure it's utility and there's a larger bug that I realized needs to be fixed but this may be the solution for. I'll keep you posted.
hey @jaybaird , still planning to merge this?