polars
polars copied to clipboard
Allow to keep empty categorical variables combinations when grouping
Problem Description
Hi,
I'm quite new to polars so I hope I'm not missing something obvious, but it seems to me that there is a difference in behavior between pandas and polars regarding grouping by several categorical variables : in pandas, empty groups combinations are kept in the output whereas it is discarded in polars.
For example starting with this small dataset :
import polars as pl
import pandas as pd
data_dict = {
"cat1": ["a", "a", "b"],
"cat2": ["foo", "bar", "foo"]
}
d_pandas = pd.DataFrame(data_dict, dtype='category')
d_polars = pl.DataFrame(
data_dict,
columns=[("cat1", pl.Categorical), ("cat2", pl.Categorical)]
)
In pandas, if I group by the two categorical variables and then count, the empty combination (b, bar)
appears in the output :
d_pandas.groupby(["cat1", "cat2"]).value_counts()
# cat1 cat2
# a bar 1
# foo 1
# b bar 0
# foo 1
# dtype: int64
But it is discarded in polars :
d_polars.groupby(["cat1", "cat2"]).count()
# shape: (3, 3)
# ┌──────┬──────┬───────┐
# │ cat1 ┆ cat2 ┆ count │
# │ --- ┆ --- ┆ --- │
# │ cat ┆ cat ┆ u32 │
# ╞══════╪══════╪═══════╡
# │ a ┆ foo ┆ 1 │
# ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ a ┆ bar ┆ 1 │
# ├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ b ┆ foo ┆ 1 │
# └──────┴──────┴───────┘
The behavior is the same with other aggregation functions : pandas will return an empty combination with NaN
as a result, whereas polars will just drop it. It seems to me that the pandas behavior can be useful in some situations (I encountered the problem when trying to create a stacked bar chart with matplotlib
).
In dplyr
the behavior is the same as polars, but there is a .drop
argument to group_by()
or count()
that allows to choose to keep empty level combinations. Maybe it could be useful to add such an option in polars ?
Many thanks for all your work on this wonderful library.
FYI in pandas it's recommended to pass the observed=True
argument to groupby
when using categoricals--otherwise sometimes you can get insane memory blowups due to combinatorial explosions. This also results in the same behavior as seen in polars:
d_pandas.groupby(["cat1", "cat2"], observed=True).value_counts()
# cat1 cat2
# a bar 1
# foo 1
# b foo 1
# dtype: int64
I prefer the default behavior mimicking pandas observed=True
, but I I agree a feature might be nice to have an optional argument to include all combinations.
Ah thanks, I didn't know about observed
, that's good to know. And I agree that the default behavior is fine and that an option would be sufficient (and it would be the same behavior as dplyr
).
Many thanks for taking the time to answer.