pandas
pandas copied to clipboard
DataFrameGroupBy.agg with nan results into inf
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.__version__
# '2.2.0'
df = pd.DataFrame({'A':['A','B'], 'B':[0,0]}).astype({'A':'string','B':'Float32'})
df['C'] = df['B']/df['B']
df.groupby('A')['C'].agg(['max','min'])
#
# max min
# A
# A -inf inf
# B -inf inf
df.groupby('A')['C'].max() # the same with .agg(max)
# A
# A -inf
# B -inf
# Name: C, dtype: Float32
df.groupby('A')['C'].min() # the same with .agg(min)
# A
# A inf
# B inf
# Name: C, dtype: Float32
Issue Description
DataFrameGroupBy.agg handles poorly nan.
Unfortunately, sometimes happens that some nullable fields have some nan. cfr: https://github.com/pandas-dev/pandas/issues/32265
And this case falls into unexpected behavior in conjunction with groupby.
In a nutshell:
Having nan into a Float field make the groupby()[min/max] computation wrong
Expected Behavior
From my perspective, a nan must generate other nan, an aggregation of nan, must again generate nan
semantically: "An invalid value, cannot be computed, so a transformation of it should result again into an invalid value"
an aggregation (via groupby) of nan, should result into nan
Installed Versions
INSTALLED VERSIONS
commit : fd3f57170aa1af588ba877e8e28c158a20a4886d python : 3.10.13.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-186-generic Version : #206-Ubuntu SMP Fri Apr 26 12:31:10 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 2.2.0 numpy : 1.23.5 pytz : 2024.1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 24.1 Cython : 0.29.37 pytest : 7.4.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.1.9 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.22.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : 2024.2.0 fsspec : 2024.6.0 gcsfs : 2024.6.0 matplotlib : 3.7.5 numba : 0.59.1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.10.1 sqlalchemy : 2.0.28 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
I believe the code you are complaining about is in these lines (but it's worth double-checking): https://github.com/pandas-dev/pandas/blob/bef88efe999809b775ed88a02f0fc2fd6d2d08a2/pandas/core/numba/kernels/min_max.py#L106 https://github.com/pandas-dev/pandas/blob/bef88efe999809b775ed88a02f0fc2fd6d2d08a2/pandas/core/numba/kernels/min_max.py#L49-L52
~~Let me explain how I think your issue report could be modified and why.~~
Scratch that
From my perspective, a nan must generate other nan, an aggregation of nan, must again generate nan
semantically: "An invalid value, cannot be computed, so a transformation of it should result again into an invalid value" an aggregation (via groupby) of nan, should result into nan
We need to be precise in the quantification all vs any when dealing with aggregation or reduction.
NumPy follows "any nan implies nan":
np.array([0, 1, 2]).max() # 2 (no nan => no nan)
np.array([0, 1, np.inf]).max() # inf (no nan => no nan)
np.array([np.nan, 0, 1, np.inf]).max() # nan (some nan => nan)
(np.array([0, 1, np.inf]) / np.array([0, 1, np.inf])).max() # nan (some nan => nan)
Pandas follows "all NA implies NA":
pd.Series([0, 1, 2], dtype='Float64').max() # 2 (no NA => no NA)
pd.Series([0, 1, np.inf], dtype='Float64').max() # inf (no NA => no NA)
pd.Series([np.nan, 0, 1, np.inf], dtype='Float64').max() # inf (some NA =/=> NA)
pd.Series([np.nan, np.nan, np.nan], dtype='Float64').max() # <NA> (all NA => NA)
To complicate the matter, Pandas treats NA and np.nan sometimes differently and sometimes not. It is still being decided by seniors in #32265 (which you referenced) what exactly the semantics of NA and np.nan should be in Pandas. The consensus tends to be that NA is a missing value, while np.nan is a bad value. In most cases, missing values can be simply ignored, unlike bad values. This explains why in a single bad value np.nan ruins the computation in NumPy, while a single missing value pd.NA does not do the same in Pandas.
Now, to complicate the matter even further, Pandas transforms np.nan into pd.NA:
s = pd.Series([np.nan, 0, 1], dtype="Float64")
s.max() # 1.0, because max([<NA>, 0, 1]) is 1
(s / s).max() # <NA>, because max([<NA>, np.nan, 1]) is np.nan which becomes <NA>
- In the first line, Pandas tranforms
np.nan(which historically denoted a missing value in Pandas, before nullable arrays were introduced). So the Series is[<NA>, 0, 1] - In the second line, as expected and as we saw before, a missing value is simply ignored:
max([<NA>, 0, 1])gives1. - In the third line,
s / sbecomes[<NA>, np.nan, 1], where0 / 0ornp.nanis a bad value, which must derail the aggregation, somax([<NA>, np.nan, 1])givesnp.nan. But this is not the end. For some reason, Pandas convertsnp.nanagain to<NA>.
~~The expected behaviour you propose, @glaucouri, would equate pd.NA to np.nan. I don't think the council of maintainers would support this. Therefore I suggest to reframe your issue differently:~~
I misunderstood your suggestion initially. You indeed insist on treating np.nan as an invalid value consistently in aggregation functions. I personally care more about consistency, so here is another example of the supposed bug:
s = pd.Series([np.nan, 0, 1], dtype="Float64")
(s / s).max() # <NA>
(s / s).groupby([9, 9, 9]).max().iat[0] # 1.0
The last two lines were expected to give the same result (whatever it should be).
~Thanks for the report, it looks like Series.max is not adhering to the default value skipna=True.~
Edit: I believe I forgot that this was groupby. There is no skipna argument.
Further investigations and PRs to fix are welcome!
take
I have a feeling all such problems stem from the fact we use NA/None and NaN interchangeably whereas they are not. Missing value and Not a Number are different concepts.
A few cornercases here:
- when all elements are NaN, the output should be NaN, there is no need to compare elements with each other
- when at least one NaN is present we follow
IEEE 754to compare elements of the collection; the output is NaN because it is implementation dependent (also depends on the order of elements). A typical min/max search algorithm starts with the first element of a collection as a potential candidate and compares it to others until there is a match so to speak, but
Every NaN shall compare unordered with everything, including itself.