polars icon indicating copy to clipboard operation
polars copied to clipboard

Add `values_counts` to DataFrame (+ `Expr.normalize`)

Open Julian-J-S opened this issue 2 years ago • 2 comments

Problem description

Expr.value_counts and Series.value_counts are already supported.

Is there a reason why this is not supported on a DataFrame? (Is this to reduce the API surface?)

import polars as pl
import pandas as pd
import numpy as np

data = np.random.randint(0, 100, size=(2, 100_000))

df_pl = pl.DataFrame({"a": data[0], "b": data[1]})
df_pd = pd.DataFrame({"a": data[0], "b": data[1]})

# polars
(
    df_pl
    .groupby(
        by=["a", "b"],
    )
    .agg(
        pl.col("a").count().alias("count")
    )
    .sort("count", reverse=True)
)

# pandas
df_pd.value_counts()

# pandas + options
df_pd.value_counts(
    subset=["a", "b"],
    sort=True,
    ascending=True,
    normalize=True,
)

imo it would be nice to have this on a DataFrame as well because it is rather verbose otherwise

is does not necessarily need subset or sort argument unless that would make it much faster/more efficient since you can use polars building blocks like select and sort to achieve the same result.

It would be cool however to add a normalize method on Expr to get another useful building block.

Julian-J-S avatar Jan 09 '23 11:01 Julian-J-S

We could add a utility method, but I am not too enthusiastic to hoisting many expressions into a DataFrame method. Here we could argue that it is a common enough utility for EDA though.

df.select([
    pl.struct(["a", "b"]).value_counts(sort=True).alias("counts")
]).unnest("counts").unnest("a")

ritchie46 avatar Jan 09 '23 12:01 ritchie46

agreed, the DataFrame method should be as small as possible (maybe add sort argument because this is already implemented for Expr and Series?).

An implementation would definitely make sense because, as you say, it is used a lot for EDA.

Using the value_counts in combination with available building blocks leads to nice and concise code (see below).

df = pl.DataFrame({
    "a": [1, 1, 1, 2, 1, 2, 4, 4, 4, 1],
    "b": ["A", "A", "A", "B", "B", "B", "C", "C", "C", "C"],
    "c": "nothing important",
})

(
    df
    .select(["a", "b"])
    .value_counts()  # <<<<< NEW
    .sort("counts", reverse=True)
    .with_column(
        (pl.col("counts") / pl.col("counts").sum()).alias("percentage")
        # pl.col("counts").normalize().alias("percentage")  # <<<<< this would be nice, too =)
    )
)
┌─────┬─────┬────────┬────────────┐
│ a   ┆ b   ┆ counts ┆ percentage │
│ --- ┆ --- ┆ ---    ┆ ---        │
│ i64 ┆ str ┆ u32    ┆ f64        │
╞═════╪═════╪════════╪════════════╡
│ 1   ┆ A   ┆ 3      ┆ 0.3        │
│ 4   ┆ C   ┆ 3      ┆ 0.3        │
│ 2   ┆ B   ┆ 2      ┆ 0.2        │
│ 1   ┆ C   ┆ 1      ┆ 0.1        │
│ 1   ┆ B   ┆ 1      ┆ 0.1        │
└─────┴─────┴────────┴────────────┘

Julian-J-S avatar Jan 09 '23 14:01 Julian-J-S

I'm a big fan of adding value_counts to the expression realm, similar to how #5165 added n_unique. This would be very nice:

df.select(
    col(['a', 'b' ,'c']).value_counts()
)

Edit: here's a sort of long-winded way to do this:

df = pl.DataFrame({
    "a": [1, 1, 1, 2, 1, 2, 4, 4, 4, 1],
    "b": ["A", "A", "A", "B", "B", "B", "C", "C", "C", "C"],
    "c": "nothing important",
})

df = df.select([
    pl.struct(['a', 'b', 'c']).value_counts().struct.field('a').struct.field('a'),
    pl.struct(['a', 'b', 'c']).value_counts().struct.field('a').struct.field('b'),
    pl.struct(['a', 'b', 'c']).value_counts().struct.field('a').struct.field('c'),
    pl.struct(['a', 'b', 'c']).value_counts().struct.field("counts"),
])

mcrumiller avatar Mar 06 '23 17:03 mcrumiller

Ok, to those who want a utility function on both a pl.DataFrame and in expression context:

def value_counts_pl(cols, **kwargs):
    """Implement value counts for a polars dataframe or expression"""
    if isinstance(cols, pl.DataFrame) or isinstance(cols, pl.internals.lazyframe.frame.LazyFrame):
        return cols.select(
            pl.struct(cols.columns).value_counts(**kwargs).alias("counts")
        ).unnest("counts").unnest(cols.columns[0])
    elif isinstance(cols, list):
        s = pl.struct(cols).value_counts(**kwargs).struct
        return [
            *[s.field(cols[0]).struct.field(col) for col in cols],
            s.field("counts"),
        ]


df = pl.DataFrame({
    "a": [1, 1, 1, 1, 1, 1, 1, 2, 2, 2],
    "b": ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd'],
    "c": "nothing important",
})

# both of these return the same result
value_counts_pl(df, sort=True)
df.select(value_counts_pl(['a', 'b', 'c'], sort=True))

mcrumiller avatar Mar 06 '23 17:03 mcrumiller

Is this really necessary? I feel the original comment overcomplicates it:

(
    df_pl
    .groupby(
        by=["a", "b"],
    )
    .agg(
        pl.col("a").count().alias("count")
    )
    .sort("count", reverse=True)
)

This could just be:

df_pl.group_by("*").count()

Or if you really want it sorted:

df_pl.group_by("*").count().sort("count", reverse=True)

I don't think value_counts is really needed when it's already this simple.

orlp avatar Jul 12 '24 12:07 orlp

@orlp that is quite simple (and in retrospect, duh), but I suppose that it's not very obvious to most users who are looking for a value counts function on their frame (it wasn't to me at least).

Performing a value counts on a frame is common for exploratory analysis and I'd still vote for it to be included.

mcrumiller avatar Jul 12 '24 13:07 mcrumiller