stream-lib icon indicating copy to clipboard operation
stream-lib copied to clipboard

Count-Min Sketch Constant for Calculating Width

Open abraithwaite opened this issue 8 years ago • 5 comments

The width of the sketch according to the paper should be set to ceil(e/epsilon) where e is Euler's number. However, I noticed in the current code that this is just set to 2.0.

I'm curious as to why this is. Is it good enough in practice for it to not make a difference?

abraithwaite avatar Oct 05 '15 17:10 abraithwaite

it looks like it was written that way from the first version submitted by @jkff. To clarify for the lazy, the difference is between ceil(e/epsilon) and the current ceil(2/epsilon), where e is the usual 2.7.... I can imagine them being similar enough in practice, but I couldn't say why it was initially chosen.

tea-dragon avatar Oct 05 '15 18:10 tea-dragon

Changing the 2 to e everywhere results in the tests still passing. I imagine that the current implementation would just less accurate than it would be if using Math.E since it would cause the sketches to be smaller. However it's unlikely that you'd be able to change this without breaking whomever is using it unfortunately.

abraithwaite avatar Oct 05 '15 19:10 abraithwaite

An intern reported this issue to me last month. Being confused, I quickly re-read the papers and noticed that e has been used in the initial paper but 2 is used in the latest paper from the same author, see Approximate Date with the Count-Min Data Structure page 4.

According to git log, @jkff wrote the implementation after this second publication. It could explain why. (I don't have time to do the maths but it would be interesting to investigate this change)

cykl avatar Oct 05 '15 19:10 cykl

The paper referenced in the javadoc uses e (as per the wayback machine anyway), but in any event, since there is no serialization issue, we could probably change it if there is cause. The latest paper using 2 gives me pause though.

tea-dragon avatar Oct 05 '15 19:10 tea-dragon

since there is no serialization issue

I think using the same arguments to the constructors would result in different width and depth parameters which wouldn't be ideal, but the epsilon would adjust itself on deserialization at least. In any case I think the safer bet is just leaving it be.

It is curious that there are two versions of this paper without an errata in the later one discussing the changes but at least this mystery is solved.

Thanks for the quick responses!

abraithwaite avatar Oct 05 '15 21:10 abraithwaite