java-LSH
java-LSH copied to clipboard
Use Bitset instead of Array<boolean>
The current implementation uses a boolean[]
as an input. Use of a BitSet (https://docs.oracle.com/javase/7/docs/api/java/util/BitSet.html) would be a lot more efficient.
For example, if dictionary size is Integer.MAX_INT
, as it would be with the "hashing shingles" approach given in 3.2.3 of Ullman et al, I need to allocate 2GB of memory to store an array of booleans. With BitSet, I can store that in approximately 8 times less space.
Hi,
Would be interesting, but only for very large dictionaries (probably 100 million entries or more): https://stackoverflow.com/a/605451/4770918