polars Add `normalize` option for `value

Add `normalize` option for `value_counts`

Open uditrana opened this issue 1 year ago • 1 comments

Problem description

Pandas has it and it should be trivial to bring over here as well. When working with extremely large datasets, the percentage of datapoints for each value is often much more descriptive than the raw number.

Example Solution

For now I implemented it myself as a utility function as follows

def value_counts(df, col, drop_nulls=False, sort=True, normalize=False):
    if drop_nulls:
        df = df.drop_nulls(col)
    vcs = df[col].value_counts(sort=sort)
    if normalize:
        vcs = vcs.with_columns((pl.col('counts') / pl.col('counts').sum()).round(4).keep_name())
    return vcs

Jul 28 '23 02:07 uditrana

Very much needed, and simpler now that count() does not include nulls.

Note that this is possible but tricky to implement when working on pl.element().value_counts() as it requires unpacking and re-packing the structs.

Jan 05 '24 18:01 mkleinbort-ic

Any update about this? Thx

Jun 12 '24 10:06 AmgadHasan

polars polars copied to clipboard

Add `normalize` option for `value_counts`

Problem description

Example Solution

polars
polars copied to clipboard