expan icon indicating copy to clipboard operation
expan copied to clipboard

Categorical binning improvement

Open gbordyugov opened this issue 9 years ago • 7 comments

To bin a list x into N bins, one could simply go for the bin index given by

binIndex = hash(x[i]) % N

gbordyugov avatar Apr 03 '17 15:04 gbordyugov

@gbordyugov Sounds interesting, can you provide a reproducible example?

jbao avatar Apr 07 '17 19:04 jbao

objectsToBin = ['those', 'strings', 'should', 'be', 'binned', 'in', 'three', 'bins']

nBins = 3

bins = [hash(o) % nBins for o in objectsToBin]

gbordyugov avatar Apr 08 '17 06:04 gbordyugov

ok, but how does this link to the categorical binning, where the use case is usually not random assignment, e.g. to group ['a','a','b','b','b'] into 2 groups?

jbao avatar Apr 10 '17 18:04 jbao

Hashing is not random, hash('a') always returns the same Int, if I'm not mistaking

gbordyugov avatar Apr 10 '17 19:04 gbordyugov

That's what I thought too, but in my example, it returns [0,0,0,0,0], or am I missing something here?

jbao avatar Apr 10 '17 19:04 jbao

In [1]: hash('a') % 2
Out[2]: 1

In [3]: hash('b') % 2
Out[4]: 0

hash collisions are, of course, possible, but extremely rare - it really seems to be depending on a particular Python installation whether hash('a') % 2 and hash('b') % 2 are the same number.

gbordyugov avatar Apr 11 '17 07:04 gbordyugov

Yes, I still don't quite get it (was not able to reproduce the results), will have to research further.

jbao avatar Apr 11 '17 09:04 jbao