algebird icon indicating copy to clipboard operation
algebird copied to clipboard

CountMinSketch[K] assumes working equals method on K.

Open johnynek opened this issue 8 years ago • 11 comments

For small count-min sketches, you create CMSItem:

https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L467

But to get the frequency of K we use ==: https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L479

if you have CMSItem[Array[Byte]] as happens in scalding for sketches of tiny key spaces (which might come up in unit tests or toy examples), this means we always reurn a frequency of 0 for CMSItem[Array[Byte]].frequency unless you happen to pass the same exact array in.

trying to bite my tongue about the design flaw in Java that equals is eq for Arrays.

johnynek avatar Nov 01 '15 20:11 johnynek

A possible solution here is to add an def equiv(a: K, b: K): Boolean method to CMSHasher[K] and then implement this for Array[Byte].

PS: @non this is why I think any Hashing typeclass should have equivalence: you often need it, and it seems impossible to state any law on the typeclass without it (well, you could make probabilisitic statements about collisions).

johnynek avatar Nov 02 '15 02:11 johnynek

Also, the SparseCMS does not seem fixable at all with Array[T] keys. Since we are actively using it for Array[Byte] in scalding, and I guess CMSHasherByteArray: is broken if we allow the SparseCMS. https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L1353

We could fix this by making the table have the key which is the hash, and then check collisions using the equivalence in the CMSHasher, or we could remove the sparse business.

johnynek avatar Nov 02 '15 02:11 johnynek

We internally use structures often where having a graceful degradation from exact down to approximate is nice. (TSAR uses these heavily), so it sort of would be nice to keep it if we could. Requiring the Hasher type class to provide the equality also i think would be most preferable if not too painful. (We'd need to be careful of course about using any built in maps or other scala stuff... which is annoying)

ianoc avatar Nov 12 '15 22:11 ianoc

This is old, but, here is the ugly wrapper class I created in scalding to make arrays have efficient orderings and hash codes:

https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/typed/HashEqualsArrayWrapper.scala

The problem is you have to make sure to wrap your arrays in these wrappers.

isnotinvain avatar Jan 20 '16 21:01 isnotinvain

that may be worth adding. The core issue is unsolved: we are using equality without communicating in the types that we are doing so. We need something like https://github.com/non/algebra/blob/master/core/src/main/scala/algebra/Eq.scala#L12

johnynek avatar Jan 21 '16 17:01 johnynek

When/if should we take the algebra dep into algebird I wonder, that was the original intent right? Or should we start going for more Equiv typeclass usage and less =='s anywhere?

ianoc avatar Jan 21 '16 17:01 ianoc

Good question. I hoped soon, then this cats/algebra conflict came up:

https://github.com/non/algebra/issues/142

The problem with Equiv is that there is a default for all types falling back to ==. So it is not much safer.

johnynek avatar Jan 22 '16 01:01 johnynek

Yeah unless we get away from use of == then users (and library implementors) just have to take care to wrap types when they have a bad ==, no way around that really.

Using an Equiv makes sense to me, maybe the migration path could be to require an Equiv wherever we use ==, and if one isn't provided, fall back to ==, then deprecate the fall back path... etc

isnotinvain avatar Jan 22 '16 01:01 isnotinvain

Scala provides a fallback Equiv (==) in the companion object already so that approach won't be much safer.

On Thu, Jan 21, 2016 at 3:38 PM, Alex Levenson [email protected] wrote:

Yeah unless we get away from use of == then users (and library implementors) just have to take care to wrap types when they have a bad ==, no way around that really.

Using an Equiv makes sense to me, maybe the migration path could be to require an Equiv wherever we use ==, and if one isn't provided, fall back to ==, then deprecate the fall back path... etc

— Reply to this email directly or view it on GitHub https://github.com/twitter/algebird/issues/497#issuecomment-173772130.

P. Oscar Boykin, Ph.D. | http://twitter.com/posco | http://pobox.com/~boykin

johnynek avatar Jan 22 '16 17:01 johnynek

Ok, well, what I said but replace scala's Equiv with something similar that has no fallback on == then?

isnotinvain avatar Jan 22 '16 21:01 isnotinvain

@isnotinvain we can definitely pick this up now that we have the algebra dep!

sritchie avatar Dec 10 '16 05:12 sritchie