polars icon indicating copy to clipboard operation
polars copied to clipboard

Allow to keep empty categorical variables combinations when grouping

Open juba opened this issue 1 year ago • 2 comments

Problem Description

Hi,

I'm quite new to polars so I hope I'm not missing something obvious, but it seems to me that there is a difference in behavior between pandas and polars regarding grouping by several categorical variables : in pandas, empty groups combinations are kept in the output whereas it is discarded in polars.

For example starting with this small dataset :

import polars as pl
import pandas as pd

data_dict = {
    "cat1": ["a",   "a",   "b"],
    "cat2": ["foo", "bar", "foo"]
}

d_pandas = pd.DataFrame(data_dict, dtype='category')

d_polars = pl.DataFrame(
    data_dict, 
    columns=[("cat1", pl.Categorical), ("cat2", pl.Categorical)]
)

In pandas, if I group by the two categorical variables and then count, the empty combination (b, bar) appears in the output :

d_pandas.groupby(["cat1", "cat2"]).value_counts()

# cat1  cat2
# a     bar     1
#       foo     1
# b     bar     0
#       foo     1
# dtype: int64

But it is discarded in polars :

d_polars.groupby(["cat1", "cat2"]).count()

# shape: (3, 3)
# ┌──────┬──────┬───────┐
# │ cat1 ┆ cat2 ┆ count │
# │ ---  ┆ ---  ┆ ---   │
# │ cat  ┆ cat  ┆ u32   │
# ╞══════╪══════╪═══════╡
# │ a    ┆ foo  ┆ 1     │
# ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ a    ┆ bar  ┆ 1     │
# ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ b    ┆ foo  ┆ 1     │
# └──────┴──────┴───────┘

The behavior is the same with other aggregation functions : pandas will return an empty combination with NaN as a result, whereas polars will just drop it. It seems to me that the pandas behavior can be useful in some situations (I encountered the problem when trying to create a stacked bar chart with matplotlib).

In dplyr the behavior is the same as polars, but there is a .drop argument to group_by() or count() that allows to choose to keep empty level combinations. Maybe it could be useful to add such an option in polars ?

Many thanks for all your work on this wonderful library.

juba avatar Sep 01 '22 12:09 juba

FYI in pandas it's recommended to pass the observed=True argument to groupby when using categoricals--otherwise sometimes you can get insane memory blowups due to combinatorial explosions. This also results in the same behavior as seen in polars:

d_pandas.groupby(["cat1", "cat2"], observed=True).value_counts()

# cat1  cat2
# a     bar     1
#       foo     1
# b     foo     1
# dtype: int64

I prefer the default behavior mimicking pandas observed=True, but I I agree a feature might be nice to have an optional argument to include all combinations.

mcrumiller avatar Sep 01 '22 15:09 mcrumiller

Ah thanks, I didn't know about observed, that's good to know. And I agree that the default behavior is fine and that an option would be sufficient (and it would be the same behavior as dplyr).

Many thanks for taking the time to answer.

juba avatar Sep 02 '22 08:09 juba