Discretizers.jl
Discretizers.jl copied to clipboard
Uniform Count Discretization using Dynamic Programming
Uniform Count Discretization requires breaking a set of values into $k$ bins of a roughly equal number of entries. This works great for most continuous data, but has some corner cases if you have a lot of repeated values.
I have a problem with "a roughly equal number of entries" and would like to more rigorously define an optimal discretization scheme.
We ideally want M/k entries per bin, where M is the number of data points and k is the number of bins.
If we use an L2 loss, the score of a particular discretization is merely sum (b - M/k)^2, where b is the size of each bin.
This results in a dynamic programming problem.