Discretizers.jl icon indicating copy to clipboard operation
Discretizers.jl copied to clipboard

Uniform Count Discretization using Dynamic Programming

Open tawheeler opened this issue 8 years ago • 0 comments

Uniform Count Discretization requires breaking a set of values into $k$ bins of a roughly equal number of entries. This works great for most continuous data, but has some corner cases if you have a lot of repeated values.

I have a problem with "a roughly equal number of entries" and would like to more rigorously define an optimal discretization scheme.

We ideally want M/k entries per bin, where M is the number of data points and k is the number of bins.

If we use an L2 loss, the score of a particular discretization is merely sum (b - M/k)^2, where b is the size of each bin.

This results in a dynamic programming problem.

tawheeler avatar Mar 28 '16 02:03 tawheeler