stream-lib icon indicating copy to clipboard operation
stream-lib copied to clipboard

Add a CountMinSketch with short values for lesser memory usage

Open tkaessmann opened this issue 8 years ago • 1 comments

We have to store a lot CountMinSketches in heap and we don't need the long values in there. Therefore we created a new one that just uses short values. It's already used in production.

tkaessmann avatar Nov 23 '16 09:11 tkaessmann

Thanks for the PR. Firstly, license header is missing in both files. More importantly, there is a large amount of duplicated code between the original CMS and the new ShortCMS classes.

Here's a rough idea for improvement:

  • Extract all codes that deal with the table field, which contain the real difference, to separate classes. For example, you can create an interface called CountMinSketchStorage, and two implementations: LongCountMinSketchStorage and ShortCountMinSketchStorage. The methods on the CMSStorage interface should be very similar, if not identical, to those on IFrequency.
  • Create a new class called something like GenericCountMinSketch. It should be constructed with a given CMSStorage object. Its internal code should delegate to the storage object where appropriate.
  • Refactor CountMinSketch and ShortCountMinSketch to extend GenericCountMinSketch, just to provide the right storage object on construction.
  • Static methods on the two CMS classes also delegate to static methods on the corresponding CMSStorage classes.

This way, people can easily add Int-based storage and what have you. Hope that makes sense.

yuesong avatar Dec 01 '16 21:12 yuesong