polars
polars copied to clipboard
Add `values_counts` to DataFrame (+ `Expr.normalize`)
Problem description
Expr.value_counts
and Series.value_counts
are already supported.
Is there a reason why this is not supported on a DataFrame? (Is this to reduce the API surface?)
import polars as pl
import pandas as pd
import numpy as np
data = np.random.randint(0, 100, size=(2, 100_000))
df_pl = pl.DataFrame({"a": data[0], "b": data[1]})
df_pd = pd.DataFrame({"a": data[0], "b": data[1]})
# polars
(
df_pl
.groupby(
by=["a", "b"],
)
.agg(
pl.col("a").count().alias("count")
)
.sort("count", reverse=True)
)
# pandas
df_pd.value_counts()
# pandas + options
df_pd.value_counts(
subset=["a", "b"],
sort=True,
ascending=True,
normalize=True,
)
imo it would be nice to have this on a DataFrame as well because it is rather verbose otherwise
is does not necessarily need subset
or sort
argument unless that would make it much faster/more efficient since you can use polars building blocks like select
and sort
to achieve the same result.
It would be cool however to add a normalize
method on Expr
to get another useful building block.
We could add a utility method, but I am not too enthusiastic to hoisting many expressions into a DataFrame
method. Here we could argue that it is a common enough utility for EDA though.
df.select([
pl.struct(["a", "b"]).value_counts(sort=True).alias("counts")
]).unnest("counts").unnest("a")
agreed, the DataFrame method should be as small as possible (maybe add sort
argument because this is already implemented for Expr
and Series
?).
An implementation would definitely make sense because, as you say, it is used a lot for EDA.
Using the value_counts
in combination with available building blocks leads to nice and concise code (see below).
df = pl.DataFrame({
"a": [1, 1, 1, 2, 1, 2, 4, 4, 4, 1],
"b": ["A", "A", "A", "B", "B", "B", "C", "C", "C", "C"],
"c": "nothing important",
})
(
df
.select(["a", "b"])
.value_counts() # <<<<< NEW
.sort("counts", reverse=True)
.with_column(
(pl.col("counts") / pl.col("counts").sum()).alias("percentage")
# pl.col("counts").normalize().alias("percentage") # <<<<< this would be nice, too =)
)
)
┌─────┬─────┬────────┬────────────┐
│ a ┆ b ┆ counts ┆ percentage │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 ┆ f64 │
╞═════╪═════╪════════╪════════════╡
│ 1 ┆ A ┆ 3 ┆ 0.3 │
│ 4 ┆ C ┆ 3 ┆ 0.3 │
│ 2 ┆ B ┆ 2 ┆ 0.2 │
│ 1 ┆ C ┆ 1 ┆ 0.1 │
│ 1 ┆ B ┆ 1 ┆ 0.1 │
└─────┴─────┴────────┴────────────┘
I'm a big fan of adding value_counts
to the expression realm, similar to how #5165 added n_unique
. This would be very nice:
df.select(
col(['a', 'b' ,'c']).value_counts()
)
Edit: here's a sort of long-winded way to do this:
df = pl.DataFrame({
"a": [1, 1, 1, 2, 1, 2, 4, 4, 4, 1],
"b": ["A", "A", "A", "B", "B", "B", "C", "C", "C", "C"],
"c": "nothing important",
})
df = df.select([
pl.struct(['a', 'b', 'c']).value_counts().struct.field('a').struct.field('a'),
pl.struct(['a', 'b', 'c']).value_counts().struct.field('a').struct.field('b'),
pl.struct(['a', 'b', 'c']).value_counts().struct.field('a').struct.field('c'),
pl.struct(['a', 'b', 'c']).value_counts().struct.field("counts"),
])
Ok, to those who want a utility function on both a pl.DataFrame
and in expression context:
def value_counts_pl(cols, **kwargs):
"""Implement value counts for a polars dataframe or expression"""
if isinstance(cols, pl.DataFrame) or isinstance(cols, pl.internals.lazyframe.frame.LazyFrame):
return cols.select(
pl.struct(cols.columns).value_counts(**kwargs).alias("counts")
).unnest("counts").unnest(cols.columns[0])
elif isinstance(cols, list):
s = pl.struct(cols).value_counts(**kwargs).struct
return [
*[s.field(cols[0]).struct.field(col) for col in cols],
s.field("counts"),
]
df = pl.DataFrame({
"a": [1, 1, 1, 1, 1, 1, 1, 2, 2, 2],
"b": ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
"c": "nothing important",
})
# both of these return the same result
value_counts_pl(df, sort=True)
df.select(value_counts_pl(['a', 'b', 'c'], sort=True))
Is this really necessary? I feel the original comment overcomplicates it:
(
df_pl
.groupby(
by=["a", "b"],
)
.agg(
pl.col("a").count().alias("count")
)
.sort("count", reverse=True)
)
This could just be:
df_pl.group_by("*").count()
Or if you really want it sorted:
df_pl.group_by("*").count().sort("count", reverse=True)
I don't think value_counts
is really needed when it's already this simple.
@orlp that is quite simple (and in retrospect, duh), but I suppose that it's not very obvious to most users who are looking for a value counts function on their frame (it wasn't to me at least).
Performing a value counts on a frame is common for exploratory analysis and I'd still vote for it to be included.