polars icon indicating copy to clipboard operation
polars copied to clipboard

Add `normalize` option for `value_counts`

Open uditrana opened this issue 1 year ago • 1 comments

Problem description

Pandas has it and it should be trivial to bring over here as well. When working with extremely large datasets, the percentage of datapoints for each value is often much more descriptive than the raw number.

Example Solution

For now I implemented it myself as a utility function as follows

def value_counts(df, col, drop_nulls=False, sort=True, normalize=False):
    if drop_nulls:
        df = df.drop_nulls(col)
    vcs = df[col].value_counts(sort=sort)
    if normalize:
        vcs = vcs.with_columns((pl.col('counts') / pl.col('counts').sum()).round(4).keep_name())
    return vcs

uditrana avatar Jul 28 '23 02:07 uditrana

Very much needed, and simpler now that count() does not include nulls.

Note that this is possible but tricky to implement when working on pl.element().value_counts() as it requires unpacking and re-packing the structs.

mkleinbort-ic avatar Jan 05 '24 18:01 mkleinbort-ic

Any update about this? Thx

AmgadHasan avatar Jun 12 '24 10:06 AmgadHasan