Categorical binning improvement
To bin a list x into N bins, one could simply go for the bin index given by
binIndex = hash(x[i]) % N
@gbordyugov Sounds interesting, can you provide a reproducible example?
objectsToBin = ['those', 'strings', 'should', 'be', 'binned', 'in', 'three', 'bins']
nBins = 3
bins = [hash(o) % nBins for o in objectsToBin]
ok, but how does this link to the categorical binning, where the use case is usually not random assignment, e.g. to group ['a','a','b','b','b'] into 2 groups?
Hashing is not random, hash('a') always returns the same Int, if I'm not mistaking
That's what I thought too, but in my example, it returns [0,0,0,0,0], or am I missing something here?
In [1]: hash('a') % 2
Out[2]: 1
In [3]: hash('b') % 2
Out[4]: 0
hash collisions are, of course, possible, but extremely rare - it really seems to be depending on a particular Python installation whether hash('a') % 2 and hash('b') % 2 are the same number.
Yes, I still don't quite get it (was not able to reproduce the results), will have to research further.