java-LSH icon indicating copy to clipboard operation
java-LSH copied to clipboard

Use Bitset instead of Array<boolean>

Open rocketraman opened this issue 5 years ago • 1 comments

The current implementation uses a boolean[] as an input. Use of a BitSet (https://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html) would be a lot more efficient.

For example, if dictionary size is Integer.MAX_INT, as it would be with the "hashing shingles" approach given in 3.2.3 of Ullman et al, I need to allocate 2GB of memory to store an array of booleans. With BitSet, I can store that in approximately 8 times less space.

rocketraman avatar Apr 04 '19 18:04 rocketraman

Hi,

Would be interesting, but only for very large dictionaries (probably 100 million entries or more): https://stackoverflow.com/a/605451/4770918

tdebatty avatar Apr 09 '19 10:04 tdebatty