pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: `dropna` affects `observed` in `DataFrame.groupby()` since v1.5

Open pwwang opened this issue 2 years ago • 2 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [ ] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# With pandas 1.5

from pandas import DataFrame, Categorical

df = DataFrame({"x": Categorical([1, 2], categories=[1, 2, 3]), "y": [3, 4]})

df.groupby("x", observed=False).grouper.result_index
# CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=False, dtype='category', name='x')

df.groupby("x", observed=False, dropna=False).grouper.result_index
# CategoricalIndex([1, 2], categories=[1, 2, 3], ordered=False, dtype='category', name='x')
# ------------------------------------------------------------------------------------------
# Unexpected result ↑

df.groupby("x", observed=False, dropna=True).grouper.result_index
# CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=False, dtype='category', name='x')


# With pandas 1.4.4 and prior

df.groupby("x", observed=False).grouper.result_index
# CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=False, dtype='category', name='x')

df.groupby("x", observed=False, dropna=False).grouper.result_index
# CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=False, dtype='category', name='x')

df.groupby("x", observed=False, dropna=True).grouper.result_index
# CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=False, dtype='category', name='x')

Issue Description

dropna=False in DataFrame.groupby() should not affect the results when observed=False.

Expected Behavior

Expected the behavior with pandas 1.4.4 and prior.

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e38bafe7300a6003a1d18bd80f3f77c763 python : 3.9.5.final.0 python-bits : 64 OS : Linux OS-release : 5.15.57.1-microsoft-standard-WSL2 Version : #1 SMP Wed Jul 27 02:20:31 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.0 numpy : 1.23.3 pytz : 2022.1 dateutil : 2.8.2 setuptools : 58.0.0 pip : 22.2.2 Cython : None pytest : 6.2.5 hypothesis : None sphinx : 4.5.0 blosc : None feather : None xlsxwriter : None lxml.etree : 4.6.3 html5lib : 1.1 pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.1.1 pandas_datareader: None bs4 : 4.9.3 bottleneck : None brotli : fastparquet : None fsspec : 2022.02.0 gcsfs : None matplotlib : 3.5.1 numba : 0.53.1 numexpr : None odfpy : None openpyxl : 3.0.8 pandas_gbq : None pyarrow : 7.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : 1.4.28 tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

pwwang avatar Sep 19 '22 21:09 pwwang

cc @rhshadrach

Confirmed:

56a71845decdc308d4210d74c7a59e42f0762c31 is the first bad commit
commit 56a71845decdc308d4210d74c7a59e42f0762c31
Author: Richard Shadrach <[email protected]>
Date:   Thu Aug 18 12:09:09 2022 -0400

    BUG: algorithms.factorize moves null values when sort=False (#46601)

#46601

phofl avatar Sep 19 '22 21:09 phofl

#46601 fixed an issue with dropna and categorical, namely dropna with categorical still drops null values. On 1.4.x:

values, dtype = (["y", None, "x", "y"], "category")
key = pd.Series(values, dtype=dtype)
df = pd.DataFrame({"key": key, "a": [1, 2, 3, 4]})
gb = df.groupby("key", dropna=False)
print(gb.sum())

#      a
# key   
# x    3
# y    5

The null value is included in the result on 1.5.0. As identified, the patch did not correctly implement the case where observed=False.

I've looked into this, and it appears to me our current implementation of categorical with nulls and dropna are incompatible in groupby. Namely, categorical encodes values as nonnegative integers with nulls being represented by -1 while groupby with dropna=False requires nulls be encoded by nonnegative integers.

We could maybe hack in a patch where we add the null value(s?) to the categories only to remove them upon returning the result. This seems like it would be too significant of a change for a patch release, fragile, and prone to bugs. I am wondering if a better direction I think would be to reimplement groupby so that negative codes are only dropped when dropna=True. This may have some drawbacks and would need some experimenting, but again, too large of a change for a patch version in my opinion.

With this, my recommendation is to undo the offending line from #46601, i.e. change

https://github.com/pandas-dev/pandas/blob/73d15a7632e1b555defcc7942e5f629161626a4c/pandas/core/groupby/grouper.py#L663

to become if self._passed_categorical:. This would make it so that dropna=False does not work with categorical again, but fixing this regression. I will put up a PR for this, but wanted to see if others have any thoughts first.

cc @jbrockmendel @mroeschke @jreback @phofl @jorisvandenbossche

rhshadrach avatar Sep 20 '22 21:09 rhshadrach