bloomfilter-rb
bloomfilter-rb copied to clipboard
Unclear how to prevent overflow?
I am trying to load 471,774 items into a native bloomfilter as such:
bloom_filter = BloomFilter::Native.new(size: lines.count, raise: true)
But after about 34,193 items it gives me a bucket got filled up
. I tried setting the size bigger (200x lines.count) but without any change.
I also tried to se the hashes, seed, and bucket options but without much luck. What am I doing wrong?
I loaded 10 million items into a 500 million item filter today without issue. Your filter needs to be >> than the amount of items you want to put in it, otherwise there will be too many collisions. Try creating a filter that has a size of 1 or even 2 orders of magnitude above the # of lines in the file.
Tried that. I made it:
bloom_filter = BloomFilter::Native.new(size: lines.count*200, raise: true)
and got the same error
Well, there's my problem. Didn't try it with raise. Sorry. I'll do that and get back to you.
Take a look at this post for estimating the size of the filter: http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/
Rule of thumb, for 500K items, you need 5M bits for 1% error rate.
So what does size
mean for the native bloomfilter? Number of items to be added?
e.g. this still doesn't work:
bloom_filter = BloomFilter::Native.new size: lines.count*15, raise: true, hashes: 11
which is loosely based on the estimating article
It also doesn't really matter how big I make the size, it dies at item 34195
https://github.com/igrigorik/bloomfilter-rb/blob/master/lib/bloomfilter/native.rb#L8-20
So yeah, in that case that last bit of code I shared should work
FYI this is the file I'm trying to load into a bloomfilter: http://dl.dropbox.com/u/108772/dictionary_words.list . It's pretty much OS X's /usr/share/dict/words
with plurals added
I'm also having similar issues. Trying to do a non-counting bloom filter of RockYou and other compromised passwords. 14 million items. If I set buckets > 1, I get both false negatives and false positives (bug?). If I set buckets = 1, the buckets overflow no matter how big I set the filter size.
@sporkmonger I think we have some confusion here. The "number of buckets" is the number of bits you'll allocate for the entire filter - setting that to one bit won't produce a very useful filter. :) ... On the other hand, the number of bits allocated to each key is the "bucket size".
- For ~1% error rate the bucket size should be ~10.
- The total size (number of buckets) should be 14M*10 bits, or ~17Mb.
For a more detailed explanation, see the "math" section here: http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/
Ah, OK. I was assuming a counting filter would be the result of buckets > 1.
@igrigorik I am still confused with your description.
You say
For ~1% error rate the bucket size should be ~10
but from the code: link, bucket size is not allowed > 8
"number of buckets" is the number of bits
so which is which for the initializer!? Could you please annotate each line for the options below with an example?
@opts = {
:size => 100,
:hashes => 4,
:seed => Time.now.to_i,
:bucket => 3,
:raise => false
}