polars
polars copied to clipboard
Add `normalize` option for `value_counts`
Problem description
Pandas has it and it should be trivial to bring over here as well. When working with extremely large datasets, the percentage of datapoints for each value is often much more descriptive than the raw number.
Example Solution
For now I implemented it myself as a utility function as follows
def value_counts(df, col, drop_nulls=False, sort=True, normalize=False):
if drop_nulls:
df = df.drop_nulls(col)
vcs = df[col].value_counts(sort=sort)
if normalize:
vcs = vcs.with_columns((pl.col('counts') / pl.col('counts').sum()).round(4).keep_name())
return vcs
Very much needed, and simpler now that count() does not include nulls.
Note that this is possible but tricky to implement when working on pl.element().value_counts()
as it requires unpacking and re-packing the structs.
Any update about this? Thx