stream-lib
stream-lib copied to clipboard
Add a CountMinSketch with short values for lesser memory usage
We have to store a lot CountMinSketches in heap and we don't need the long
values in there. Therefore we created a new one that just uses short
values. It's already used in production.
Thanks for the PR. Firstly, license header is missing in both files. More importantly, there is a large amount of duplicated code between the original CMS and the new ShortCMS classes.
Here's a rough idea for improvement:
- Extract all codes that deal with the
table
field, which contain the real difference, to separate classes. For example, you can create an interface called CountMinSketchStorage, and two implementations: LongCountMinSketchStorage and ShortCountMinSketchStorage. The methods on the CMSStorage interface should be very similar, if not identical, to those on IFrequency. - Create a new class called something like GenericCountMinSketch. It should be constructed with a given CMSStorage object. Its internal code should delegate to the storage object where appropriate.
- Refactor CountMinSketch and ShortCountMinSketch to extend GenericCountMinSketch, just to provide the right storage object on construction.
- Static methods on the two CMS classes also delegate to static methods on the corresponding CMSStorage classes.
This way, people can easily add Int-based storage and what have you. Hope that makes sense.