cobra icon indicating copy to clipboard operation
cobra copied to clipboard

Unexpected bin limit rounding

Open sandervh14 opened this issue 4 years ago • 1 comments

While reviewing a pull request, I tested the below method with some extra things. I noticed this:

@pytest.mark.parametrize("n_bins, auto_adapt_bins, data, expected",
                                          [
                                              (2, 
                                               False,
                                               # Variable contains floats, nans and inf (e.g. "available resources" variable): WORKS
                                               pd.DataFrame({"variable": [7.5, 8.2, 14.9, np.inf]}),
                                               [(7.0, 12.0), (12.0, np.inf)])
                                           ],
                                           ids=["unexpected bin limit"])
def test_fit_column(self, n_bins, auto_adapt_bins, data, expected):
        discretizer = KBinsDiscretizer(n_bins=n_bins,
                                                        auto_adapt_bins=auto_adapt_bins)

        actual = discretizer._fit_column(data, column_name="variable")

        assert actual == expected

Result:

[(8.0, 12.0), (12.0, inf)] != [(7.0, 12.0), (12.0, inf)]

Expected :[(7.0, 12.0), (12.0, inf)] Actual :[(8.0, 12.0), (12.0, inf)]

The column minimum is 7.5, which is rounded up (!) in KBinsDiscretizer._compute_bins_from_edges() to format the lower (!) bin limit, although it should have been rounded down: column value 7.5 falls in a bin that starts at 7.0, not 8.0.

I thought of just changing

        for a, b in zip(bin_edges, bin_edges[1:]):
            fmt_a = round(a, precision)
            fmt_b = round(b, precision)

to just

        for a, b in zip(bin_edges, bin_edges[1:]):
            fmt_a = floor(a, precision)  # <====
            fmt_b = round(b, precision)

but that will break the precision, which can be even negative apparently:

compute the minimal precision of the bin_edges. This can be a negative number, which then rounds numbers to the nearest 10, 100, ...

So minor bug, but floor() is not a good enough solution - I won't fix this right away with a commit in the pull request (which is actually not enough related too), so will log it here and continue with that pull request.

sandervh14 avatar Mar 31 '21 14:03 sandervh14

I add this here as a reminder: someone found the binning of continuous variables counterintuitive, as in the edges being [a, b] for the lowest category and ]a, b] for the others (his intuition is: [a, b[ everywhere). We should check or at least document.

sborms avatar May 02 '22 14:05 sborms