pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: pd.cut(df, bins=N) current behavior is incorrect, returns float where int is the limit

Open FlorinAndrei opened this issue 3 years ago • 3 comments
trafficstars

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

https://github.com/FlorinAndrei/misc/blob/master/HeartDisease.csv

import pandas as pd
hd = pd.read_csv("HeartDisease.csv")
pd.cut(hd["Age"], bins=3, include_lowest=True)

Issue Description

The lowest of the three bins created is: (28.951, 45.0]. This is incorrect in several ways.

First off, I expect a left-inclusive bin there. That bin is not left-inclusive.

Secondly, the minimum value in that column is 29. It is not 28.951 - that float is an artifact of the library and does not exist in the data.

One workaround you can find online is this:

_, edges = pd.cut(hd["Age"], bins=3, include_lowest=True, retbins=True)
edges_r = [round(x) for x in edges]
pd.cut(hd["Age"], bins=edges_r)

But this is pointless and annoying. The library should simply return the true minimum value.

Expected Behavior

The bin I expect is [29.0, 45.0]. I would also settle for [29, 45].

Installed Versions

pd.show_versions() Traceback (most recent call last): File "", line 1, in File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\util_print_versions.py", line 109, in show_versions deps = _get_dependency_info() File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\util_print_versions.py", line 88, in get_dependency_info mod = import_optional_dependency(modname, errors="ignore") File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages\pandas\compat_optional.py", line 138, in import_optional_dependency module = importlib.import_module(name) File "C:\Program Files\Python310\lib\importlib_init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 1050, in gcd_import File "", line 1027, in find_and_load File "", line 1002, in find_and_load_unlocked File "", line 945, in find_spec File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 79, in find_spec return method() File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 100, in spec_for_pip if self.pip_imported_during_build(): File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 111, in pip_imported_during_build return any( File "C:\Users\flori\AppData\Roaming\Python\Python310\site-packages_distutils_hack_init.py", line 112, in frame.f_globals['file'].endswith('setup.py') KeyError: 'file'

I do have the latest Pandas installed (1.4.3) but there's another bug now with show_versions() that prevents me from printing that info.

Python 3.10.6

Numpy 1.22.4

Windows 10

Jupyter Notebook

FlorinAndrei avatar Aug 06 '22 22:08 FlorinAndrei

For each item, the same interval close form is expected. So now we set the lowest edge as min - 0.1% * (max - min) when close is right to envelop the mininal value.

GYHHAHA avatar Aug 06 '22 23:08 GYHHAHA

I agree with you. Although the docs suggest the following for bins:

int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

I think this is unnecessary and, as you put it, annoying.

Also I think this function is probably quite useful (for categorising continuous data in an ML context) and possibly likely to increase in usage. It could do with revamp. It was also added to #40245 without yet being addressed.

For numpy.histogram I think the behaviour is much better, i.e. they get [1,2), [2,3), [3,4]

All but the last (righthand-most) bin is half-open.

Additionally numpy.histogram has a very usful arg density, which allows for a different algo to calculate bins, included with weights.

For example

>>> s = np.random.rand(100)
>>> s[0], s[99] = 0, 1
>>> df = pd.DataFrame({"samples": s})
>>> pd.cut(df["samples"], bins=10, retbins=True)
... 
Categories (10, interval[float64, right]): [(-0.001, 0.1] < (0.1, 0.2] < (0.2, 0.3] < (0.3, 0.4] ... (0.6, 0.7] < (0.7, 0.8] < (0.8, 0.9] < (0.9, 1.0]]

>>> np.histogram(df["samples"])
(array([ 7,  4, 11, 14, 11,  9,  8, 14, 11, 11]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]))

If doing: pd.cut(df["samples"], bins=np.histogram(df["samples"])[1], retbins=True), then the first datapoint at 0 is excluded. If include_lowest is set to True it just reduces the leftmost specified bin by an arbitrary 0.1% (instead of actually making it inclusive), which is horrible, IMO.

attack68 avatar Aug 07 '22 06:08 attack68

For numpy.histogram I think the behaviour is much better, i.e. they get [1,2), [2,3), [3,4]

Indeed, numpy.histogram() does the right thing if you request that both ends are included. The R function discretize() has the same behavior as Numpy. This is what people expect when they use a function like this.

If include_lowest is set to True it just reduces the leftmost specified bin by an arbitrary 0.1% (instead of actually making it inclusive), which is horrible, IMO.

That is precisely my main objection. The current behavior of this function makes stuff up. It's just a very, very poor hack.

FlorinAndrei avatar Aug 07 '22 23:08 FlorinAndrei

I am facing the same issue right now of pd.cut() creating intervals with left bound excluded and right bound included. Totally support @FlorinAndrei. The default behavior should be that the left bound is included and the right one is excluded. I did not expect any other variant. Although, I firured out that it suffies to set right = False. So at least there is a way to setup the function to create intervals as I prefer.

import pandas as pd
s = pd.Series([1,1,2,2,3,4,5])
pd.cut(s, bins = [-np.inf, 2, np.inf], right = False)

returns

0    [-inf, 2.0)
1    [-inf, 2.0)
2     [2.0, inf)
3     [2.0, inf)
4     [2.0, inf)
5     [2.0, inf)
6     [2.0, inf)
Length: 7, dtype: category
Categories (2, interval[float64, left]): [[-inf, 2.0) < [2.0, inf)]```

andrey-orderby avatar Feb 20 '24 14:02 andrey-orderby

PRs are welcome. It is always encouraged to support a software that is free and is built by unpaid volunteers.

attack68 avatar Feb 20 '24 15:02 attack68