bloom-filters icon indicating copy to clipboard operation
bloom-filters copied to clipboard

Todo list

Open folkvir opened this issue 6 years ago • 7 comments

  • [x] HyperLogLog
  • [x] Top-K or Mine Top-K: Using Bloom Filters for Mining Top-k Frequent Itemsets in Data Streams, Younghee Kim et Al.
  • [x] Min Hash Wikipedia source, need to find the paper
  • [x] Scalable Bloom Filter, as the partitioned bloom filter is fixed now the Scalable bloom filter can be easily implemented.
  • [x] XorFilter #29

Feel free to modify this list or add suggestions in comments.

folkvir avatar Sep 12 '19 07:09 folkvir

I don't need it. I didn't try it. But for the sake of completeness: https://lemire.me/blog/2019/12/19/xor-filters-faster-and-smaller-than-bloom-filters/ Links to the paper and implementations (no js ones) are available there.

( Hello btw! :3 )

Chat-Wane avatar Dec 20 '19 08:12 Chat-Wane

Some update on this topic:

  • The HyperLogLog has been implemented in release v1.1.0.
  • The TopK has been implemented in release v1.2.0. It uses a Count Min Sketch for frequency estimation and a Minheap for implementing a sliding window of values.

Callidon avatar Apr 09 '20 08:04 Callidon

Some update: The MinHash has been released in v1.3.0.

Callidon avatar Apr 22 '20 07:04 Callidon

Thanks for your jobs.

ShisiJu avatar Nov 24 '20 01:11 ShisiJu

Update: XOR filters has been released in the 2.0.0 version of the npm package, following #52 merge.

Callidon avatar Feb 18 '22 09:02 Callidon

Update: the Scalable Bloom Filter with optimizations and bug fixes have been released in version 3.0.0

folkvir avatar Mar 31 '22 07:03 folkvir

Hi @Callidon and @folkvir - thanks for a really useful library!

I had a suggestion for an additional data structure, as well as a question/suggestion for improvement on the CountingBloomFilter.

I have a need for a Time-To-Live, or "expiring", CountingBloomFilter. That is, when entries are added to the CountingBloomFilter, they will automatically be removed after some number of milliseconds. In addition, upon creation of the CountingBloomFilter, the creator can optionally specify a "dispose" callback function that will get called when the entry expires. The idea is generally to merge the concepts of a TTL Cache (e.g. https://github.com/isaacs/ttlcache) with a CountingBloomFilter. Obviously this could just be built as a separate data structure on top of CountingBloomFilter that manages the timeouts, but I think there is a lot of utility to having it in one single data structure. I am going to implement this for a project I need, but please let me know if you would like me to contribute this back to your repository.

My other question relates to the implementation of the _filter member of the CountingBloomFilter class. This Array stores a tuple pair at each index that is [bit 0 or 1, count of items at this index]. But that first bit is always set to 1 if count of items is > 0, and set to 0 if count of items === 0. So I might be missing something but why not just make _filter an array of numbers where each entry is the count of items at that index, and then the has method would just look at whether that value is > 0? Seems like having each value in the _filter member storing an Array object with 2 numbers is a lot of unnecessary memory.

adevine avatar Aug 21 '22 05:08 adevine