polars
polars copied to clipboard
Nulls in categorical col cause mislabelled value counts
Polars version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of Polars.
Issue description
When you call value_counts on a categorical col, the resulting is missing the null value and replaces it with another value, so you end up with something like this:
┌──────────────┐
│ job │
│ --- │
│ struct[2] │
╞══════════════╡
│ {"waiter",1} │
│ {"doctor",1} │
│ {"doctor",3} │
└──────────────┘
Interestingly, if you call value_counts on the Series it works fine.
Reproducible example
import polars as pl
s = pl.Series(
"job", ["doctor", "waiter", None, None, None], pl.Categorical
)
df = pl.DataFrame([s])
print(
df
.select(pl.col("job").value_counts())
)
Expected behavior
┌──────────────┐
│ job │
│ --- │
│ struct[2] │
╞══════════════╡
│ {"waiter",1} │
│ {"doctor",1} │
│ {null,3} │
└──────────────┘
Installed versions
---Version info---
Polars: 0.15.11
Index type: UInt32
Platform: Linux-5.15.85-1-MANJARO-x86_64-with-glibc2.36
Python: 3.11.0 | packaged by conda-forge | (main, Oct 25 2022, 06:24:40) [GCC 10.4.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.23.5
fsspec: 2022.11.0
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: 3.6.2
Interestingly though, this sort of works:
import polars as pl
s = pl.Series("job", ["doctor", "waiter", None, None, None], pl.Categorical)
df = pl.DataFrame([s])
# This doesn't work
print(df.select(pl.col("job").value_counts()))
# This does
print(df.select(pl.col("job")).to_series().value_counts())
Output:
shape: (3, 1)
┌──────────────┐
│ job │
│ --- │
│ struct[2] │
╞══════════════╡
│ {"doctor",1} │
│ {"doctor",3} │
│ {"waiter",1} │
└──────────────┘
shape: (3, 2)
┌────────┬────────┐
│ job ┆ counts │
│ --- ┆ --- │
│ cat ┆ u32 │
╞════════╪════════╡
│ doctor ┆ 1 │
│ null ┆ 3 │
│ waiter ┆ 1 │
└────────┴────────┘