pandas DataFrameGroupBy.agg with nan results into inf

Pandas version checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.__version__
# '2.2.0'

df = pd.DataFrame({'A':['A','B'], 'B':[0,0]}).astype({'A':'string','B':'Float32'})
df['C'] = df['B']/df['B']

df.groupby('A')['C'].agg(['max','min'])

# 
#    max  min
# A          
# A -inf  inf
# B -inf  inf

df.groupby('A')['C'].max() # the same with .agg(max)

# A
# A   -inf
# B   -inf
# Name: C, dtype: Float32

df.groupby('A')['C'].min()  # the same with .agg(min)

# A
# A    inf
# B    inf
# Name: C, dtype: Float32

Issue Description

DataFrameGroupBy.agg handles poorly nan.

Unfortunately, sometimes happens that some nullable fields have some nan. cfr: https://github.com/pandas-dev/pandas/issues/32265

And this case falls into unexpected behavior in conjunction with groupby.

In a nutshell:

Having nan into a Float field make the groupby()[min/max] computation wrong

Expected Behavior

From my perspective, a nan must generate other nan, an aggregation of nan, must again generate nan

semantically: "An invalid value, cannot be computed, so a transformation of it should result again into an invalid value"

an aggregation (via groupby) of nan, should result into nan

Installed Versions

INSTALLED VERSIONS

commit : fd3f57170aa1af588ba877e8e28c158a20a4886d python : 3.10.13.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-186-generic Version : #206-Ubuntu SMP Fri Apr 26 12:31:10 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 2.2.0 numpy : 1.23.5 pytz : 2024.1 dateutil : 2.8.2 setuptools : 65.5.0 pip : 24.1 Cython : 0.29.37 pytest : 7.4.4 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.1.9 lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.3 IPython : 8.22.1 pandas_datareader : None adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None bottleneck : None dataframe-api-compat : None fastparquet : 2024.2.0 fsspec : 2024.6.0 gcsfs : 2024.6.0 matplotlib : 3.7.5 numba : 0.59.1 numexpr : None odfpy : None openpyxl : 3.1.2 pandas_gbq : None pyarrow : 12.0.1 pyreadstat : None python-calamine : None pyxlsb : None s3fs : None scipy : 1.10.1 sqlalchemy : 2.0.28 tables : None tabulate : 0.9.0 xarray : None xlrd : None zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None

Jun 26 '24 09:06 glaucouri

I believe the code you are complaining about is in these lines (but it's worth double-checking): https://github.com/pandas-dev/pandas/blob/bef88efe999809b775ed88a02f0fc2fd6d2d08a2/pandas/core/numba/kernels/min_max.py#L106 https://github.com/pandas-dev/pandas/blob/bef88efe999809b775ed88a02f0fc2fd6d2d08a2/pandas/core/numba/kernels/min_max.py#L49-L52

~~Let me explain how I think your issue report could be modified and why.~~

Scratch that

From my perspective, a nan must generate other nan, an aggregation of nan, must again generate nan

semantically: "An invalid value, cannot be computed, so a transformation of it should result again into an invalid value" an aggregation (via groupby) of nan, should result into nan

We need to be precise in the quantification all vs any when dealing with aggregation or reduction.

NumPy follows "any nan implies nan":

np.array([0, 1, 2]).max()  # 2 (no nan => no nan)
np.array([0, 1, np.inf]).max()  # inf (no nan => no nan)
np.array([np.nan, 0, 1, np.inf]).max()  # nan (some nan => nan)
(np.array([0, 1, np.inf]) / np.array([0, 1, np.inf])).max()  # nan (some nan => nan)

Pandas follows "all NA implies NA":

pd.Series([0, 1, 2], dtype='Float64').max()  # 2 (no NA => no NA)
pd.Series([0, 1, np.inf], dtype='Float64').max()  # inf (no NA => no NA)
pd.Series([np.nan, 0, 1, np.inf], dtype='Float64').max()  # inf (some NA =/=> NA)
pd.Series([np.nan, np.nan, np.nan], dtype='Float64').max()  # <NA> (all NA => NA)

To complicate the matter, Pandas treats NA and np.nan sometimes differently and sometimes not. It is still being decided by seniors in #32265 (which you referenced) what exactly the semantics of NA and np.nan should be in Pandas. The consensus tends to be that NA is a missing value, while np.nan is a bad value. In most cases, missing values can be simply ignored, unlike bad values. This explains why in a single bad value np.nan ruins the computation in NumPy, while a single missing value pd.NA does not do the same in Pandas.

Now, to complicate the matter even further, Pandas transforms np.nan into pd.NA:

s = pd.Series([np.nan, 0, 1], dtype="Float64")
s.max()  # 1.0, because max([<NA>, 0, 1]) is 1
(s / s).max()  # <NA>, because max([<NA>, np.nan, 1]) is np.nan which becomes <NA>

In the first line, Pandas tranforms np.nan (which historically denoted a missing value in Pandas, before nullable arrays were introduced). So the Series is [<NA>, 0, 1]
In the second line, as expected and as we saw before, a missing value is simply ignored: max([<NA>, 0, 1]) gives 1.
In the third line, s / s becomes [<NA>, np.nan, 1], where 0 / 0 or np.nan is a bad value, which must derail the aggregation, so max([<NA>, np.nan, 1]) gives np.nan. But this is not the end. For some reason, Pandas converts np.nan again to <NA>.

~~The expected behaviour you propose, @glaucouri, would equate pd.NA to np.nan. I don't think the council of maintainers would support this. Therefore I suggest to reframe your issue differently:~~

I misunderstood your suggestion initially. You indeed insist on treating np.nan as an invalid value consistently in aggregation functions. I personally care more about consistency, so here is another example of the supposed bug:

s = pd.Series([np.nan, 0, 1], dtype="Float64")
(s / s).max()  # <NA>
(s / s).groupby([9, 9, 9]).max().iat[0]  # 1.0

The last two lines were expected to give the same result (whatever it should be).

Jun 26 '24 16:06 avm19

~Thanks for the report, it looks like Series.max is not adhering to the default value skipna=True.~

Edit: I believe I forgot that this was groupby. There is no skipna argument.

Further investigations and PRs to fix are welcome!

Jul 24 '24 22:07 rhshadrach

take

Aug 13 '24 00:08 kushagr96

I have a feeling all such problems stem from the fact we use NA/None and NaN interchangeably whereas they are not. Missing value and Not a Number are different concepts.

A few cornercases here:

when all elements are NaN, the output should be NaN, there is no need to compare elements with each other
when at least one NaN is present we follow IEEE 754 to compare elements of the collection; the output is NaN because it is implementation dependent (also depends on the order of elements). A typical min/max search algorithm starts with the first element of a collection as a potential candidate and compares it to others until there is a match so to speak, but

Every NaN shall compare unordered with everything, including itself.

Sep 25 '24 16:09 vkhodygo

pandas pandas copied to clipboard

DataFrameGroupBy.agg with nan results into inf

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

pandas
pandas copied to clipboard