bloomfilter-rb icon indicating copy to clipboard operation
bloomfilter-rb copied to clipboard

Unclear how to prevent overflow?

Open cbetta opened this issue 12 years ago • 14 comments

I am trying to load 471,774 items into a native bloomfilter as such:

bloom_filter = BloomFilter::Native.new(size: lines.count, raise: true)

But after about 34,193 items it gives me a bucket got filled up. I tried setting the size bigger (200x lines.count) but without any change.

I also tried to se the hashes, seed, and bucket options but without much luck. What am I doing wrong?

cbetta avatar Apr 27 '12 17:04 cbetta

I loaded 10 million items into a 500 million item filter today without issue. Your filter needs to be >> than the amount of items you want to put in it, otherwise there will be too many collisions. Try creating a filter that has a size of 1 or even 2 orders of magnitude above the # of lines in the file.

WattsInABox avatar Apr 27 '12 21:04 WattsInABox

Tried that. I made it:

bloom_filter = BloomFilter::Native.new(size: lines.count*200, raise: true)

and got the same error

cbetta avatar Apr 27 '12 21:04 cbetta

Well, there's my problem. Didn't try it with raise. Sorry. I'll do that and get back to you.

WattsInABox avatar Apr 27 '12 21:04 WattsInABox

Take a look at this post for estimating the size of the filter: http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/

Rule of thumb, for 500K items, you need 5M bits for 1% error rate.

igrigorik avatar Apr 27 '12 22:04 igrigorik

So what does size mean for the native bloomfilter? Number of items to be added?

cbetta avatar Apr 27 '12 22:04 cbetta

e.g. this still doesn't work:

bloom_filter = BloomFilter::Native.new size: lines.count*15, raise: true, hashes: 11

which is loosely based on the estimating article

cbetta avatar Apr 27 '12 22:04 cbetta

It also doesn't really matter how big I make the size, it dies at item 34195

cbetta avatar Apr 27 '12 22:04 cbetta

https://github.com/igrigorik/bloomfilter-rb/blob/master/lib/bloomfilter/native.rb#L8-20

igrigorik avatar Apr 27 '12 22:04 igrigorik

So yeah, in that case that last bit of code I shared should work

cbetta avatar Apr 27 '12 22:04 cbetta

FYI this is the file I'm trying to load into a bloomfilter: http://dl.dropbox.com/u/108772/dictionary_words.list . It's pretty much OS X's /usr/share/dict/words with plurals added

cbetta avatar Apr 27 '12 22:04 cbetta

I'm also having similar issues. Trying to do a non-counting bloom filter of RockYou and other compromised passwords. 14 million items. If I set buckets > 1, I get both false negatives and false positives (bug?). If I set buckets = 1, the buckets overflow no matter how big I set the filter size.

sporkmonger avatar Nov 07 '13 07:11 sporkmonger

@sporkmonger I think we have some confusion here. The "number of buckets" is the number of bits you'll allocate for the entire filter - setting that to one bit won't produce a very useful filter. :) ... On the other hand, the number of bits allocated to each key is the "bucket size".

  • For ~1% error rate the bucket size should be ~10.
  • The total size (number of buckets) should be 14M*10 bits, or ~17Mb.

For a more detailed explanation, see the "math" section here: http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-ruby/

igrigorik avatar Nov 09 '13 19:11 igrigorik

Ah, OK. I was assuming a counting filter would be the result of buckets > 1.

sporkmonger avatar Nov 14 '13 13:11 sporkmonger

@igrigorik I am still confused with your description.

You say

For ~1% error rate the bucket size should be ~10

but from the code: link, bucket size is not allowed > 8

"number of buckets" is the number of bits

so which is which for the initializer!? Could you please annotate each line for the options below with an example?

@opts = {
  :size    => 100,
  :hashes  => 4,
  :seed    => Time.now.to_i,
  :bucket  => 3,
  :raise   => false
 }

hackhowtofaq avatar Dec 11 '19 13:12 hackhowtofaq