pybloomfiltermmap icon indicating copy to clipboard operation
pybloomfiltermmap copied to clipboard

Hash function cuasing false postives. (Nasty)

Open TheRook opened this issue 14 years ago • 6 comments

I noticed some bad behavior with the pybloomfiler to I wrote a simple set of tests. A ~22% failure rate is really bad. As a quick fix i am using the hash() function prior to passing the data to pyboomfiler. Usually the false positive rate with this function is 0, sometimes its really close to 0.

My simple test: http://pastebin.com/C8KwWFxR

The output: ('false positive', 0.2251722) ('false negitive', 0.0) with hash() ('false positive', 1e-07) ('false negitive', 0.0) with md5() ('false positive', 0.0) ('false negitive', 0.0) with md5() hexdigigest ('false positive', 0.0) ('false negitive', 0.0)

TheRook avatar Dec 23 '11 20:12 TheRook

Hey, now that the holidays are over I'm looking into this. Note that false positives are expected with bloom filters --- false negatives are the true nasty behavior. If you say 0.0001 for the error rate and you are below the capacity (which you are), you should expect less than 0.0001 error rate. I am looking into why it's broken.

axiak avatar Jan 03 '12 15:01 axiak

Just to confirm, I've had serious problems with both the default SHA512 hash (a large memory leak) and a really bad false positive rate using the (i think) djb hash. I've just tried the superfast hash though and that works very well. Thanks.

(python 2.6 on ubuntu 64bit)

xlfe avatar May 03 '12 23:05 xlfe

I just submitted a fix for the memory leaks referred to in this issue

pbutler avatar Jul 07 '12 03:07 pbutler

hi axiak, i want to know how is this issue going on? i think "using the hash() function prior to passing the data to pyboomfiler" may increase the false positive rate when the number of input items is large, and the test also proved it.

zzc3266825 avatar Aug 06 '12 02:08 zzc3266825

I think this is a duplicate of Issue #46

dbishop avatar Feb 24 '14 22:02 dbishop

Seems fixed in latest release 0.3.14:

without hash()
('false positive', 1.04e-05)
('false negitive', 0.0)
with hash()
('false positive', 1.02e-05)
('false negitive', 0.0)
with md5()
('false positive', 1.03e-05)
('false negitive', 0.0)
with md5() hexdigigest
('false positive', 8.7e-06)
('false negitive', 0.0)

What really matters is the first result:

without hash()
('false positive', 1.04e-05)
('false negitive', 0.0)

IMHO this ticket should be closed

andresriancho avatar May 13 '15 20:05 andresriancho