polars Improve `DataFrame.describe` by adding information about missing values

Improve `DataFrame.describe` by adding information about missing values

Open danielgafni opened this issue 2 years ago • 5 comments

Describe your feature request

It's an important piece of information. I had multiple cases where suddenly discovering a lot of missing values in a DataFrame pointed to a bug in the upstream pipeline. It would be nice to have it in the describe method, it really fits there.

My proposal is:

implement DataFrame.missing - an average aggregation of is_null:

df.select([pl.col(c).is_null().cast(pl.Float64).mean().alias(c) for c in df.columns])

use it for DataFrame.describe

The questions here are:

should it be added to .describe output by default? Probably not, since it may break some code (I'm not sure how often will this actually happen)
should it be activated by an optional argument with a default value of False? For python, this makes sense, but rust doesn't have default values for arguments. Adding an argument would cause code refactoring in all the places where .describe is called.

What's the best solution here? I would also like to implement this PR myself as a small rust exercise.

Jul 18 '22 08:07 danielgafni

Not really worth a method IMO:

df = pl.DataFrame({
    "a": [1, 2, 3, None],
    "b": [1 ,2, 3, 4]
})

df.select([
    pl.all().null_count() / pl.count()
])

shape: (1, 2)
┌──────┬─────┐
│ a    ┆ b   │
│ ---  ┆ --- │
│ f64  ┆ f64 │
╞══════╪═════╡
│ 0.25 ┆ 0.0 │
└──────┴─────┘

I do agree that it might be a nice addition to describe.

Jul 18 '22 08:07 ritchie46

Sure, probably we don't need a separate method. Anyway, what would be your advice about the .describe behavior / signature?

Jul 18 '22 08:07 danielgafni

I have raised #4820, which implements amongst others count and null_count.

Sep 11 '22 06:09 zundertj

Do you have plans to add the null_count to the Rust version of describe?

Dec 04 '22 04:12 philss

@philss : that seems sensible to me, but I'm not a Rust programmer. So that will have to be someone else. Feel free to raise a PR.

Dec 10 '22 12:12 zundertj

polars polars copied to clipboard

Improve `DataFrame.describe` by adding information about missing values

Describe your feature request

polars
polars copied to clipboard