PeepOpen-Issues
PeepOpen-Issues copied to clipboard
Optimization using Bloom Filters
Hi,
I think there is a relatively simple optimization that could boost the performance with a very little penalty in memory usage. The current caching strategy seems to be to keep the latests results available so that if the user keeps writing they are used as the dictionary instead of the whole list. This works great for realtime filtering but does not help when starting a new search.
By using bloom filters, which are a very space efficient structure, the entries that match any given abbreviation could be kept in memory, drastically reducing the need for a full examination of the dictionary every time a search is started. The algorithm would look something like:
- Get a bloom filter associated to an abbreviation
- If none found try finding one that shares a prefix (abbr -> abb -> ab -> a)
- Also create a new bloom filter for the abbreviation
- Iterate over the whole dictionary
- If a bloom filter was found and it says that the value is not present, skip processing it
- Calculate the score for the item
- If it's a match add it to the new bloom filter (if applies)
Since the structures are very compact, weighting just over a kilobyte for 1000 items, the filters can be kept in memory almost indefinitely.
Another great feature is that the filters/caches do not need to be destroyed when the dictionary is refreshed. Removed items do not need to update them and for added ones a simple algorithm can be implemented to update the filters:
- For each added item
- For each filter
- Apply the scoring algorithm against the abbreviation associated to the filter
- If it matches add it to the filter
- For each filter
A further optimization would be to store the hash used for the bloom filter with the item, so that it's not needed to be computed time and time again, just when adding the item to the list, making the check operation a simple bits test. The filters might even be dynamically sized based on the total numbers of elements and the length of the abbreviation, but this is probably too much work to save a few bytes.
I think that this strategy can provide very good results even if it slows down a little bit the dictionary refreshing operation or consumes a little bit more memory. It could make working with large projects a painless process :)