php-bloom-filter
php-bloom-filter copied to clipboard
fast(est?) bloom filter in pure PHP
blog
php-bloom-filter
- This is a fast (possibly the fastest?) single threaded bloom filter implementation in pure PHP.
- There are no dependencies on external modules unlike many other implementations.
- It uses a binary string to store the bit vector and manipulates based on byte indexes of the string.
- Apache 2.0 License.
There are 3 implementations with slightly different performance characteristics:
- a 1GB vector implementation with (8-40 bit addressing) with any hash.
- a 512MB vector implementation (32 bit addressing) with MD5.
- a 64KB vector implementation (16 bit addressing) with MD5.
math
-
m vector bits
-
k hash functions
-
n elements
-
p probability of false positive
-
(1-(1-1/m)^(mln(2)))^(mln(2)/n)=p
-
k = m*ln(2)/n;
performance
On a 4.4GHz G3258 system, the single threaded insert/lookup throughput in elements per second respectively with six bit sets per key (k = 6):
On PHP 5.5
- a 1GB vector implementation with (8-40 bit addressing) with any hash: 209K / 329K
- a 512MB vector implementation (32 bit addressing) with MD5: 255K / 407K
- a 64KB vector implementation (16 bit addressing) with MD5: 347K / 602K
On HHVM:
- a 1GB vector implementation with (8-40 bit addressing) with any hash: 386K / 608K
- a 512MB vector implementation (32 bit addressing) with MD5: 510K / 874K
- a 64KB vector implementation (16 bit addressing) with MD5: 429K / 675K
It is faster than the following implementations by a significant margin:
- https://github.com/mrspartak/php.bloom.filter
- https://code.google.com/p/php-bloom-filter/
- https://packagist.org/packages/pleonasm/bloom-filter
MD5 hashing is one of the more expensive operations and PHP does not have a native implementation of xxHash or Murmurhash3.
cautionary tales
- PHP Limitations
- Strings are limited to byte addressing of signed 32 bit integers. The maximum string is only 2GB - 1B (2^31-1 Bytes).
- The bit vector only supports powers of 2 bits in this implementation. Thus the largest vector size is 1GB.
- Workaround with multiple strings could allow for implementations greater than 1GB.
- PHP 5.4+
- PHP lacks calloc or malloc so str_repeat is used to allocate the bit array.
- PHP cannot directly use the output of str_repeat and primitive assignment will require double the memory of the vector size due to the copy.