polars
polars copied to clipboard
Improve `DataFrame.describe` by adding information about missing values
Describe your feature request
It's an important piece of information. I had multiple cases where suddenly discovering a lot of missing values in a DataFrame pointed to a bug in the upstream pipeline. It would be nice to have it in the describe
method, it really fits there.
My proposal is:
- implement
DataFrame.missing
- an average aggregation ofis_null
:
df.select([pl.col(c).is_null().cast(pl.Float64).mean().alias(c) for c in df.columns])
- use it for
DataFrame.describe
The questions here are:
- should it be added to
.describe
output by default? Probably not, since it may break some code (I'm not sure how often will this actually happen) - should it be activated by an optional argument with a default value of
False
? For python, this makes sense, but rust doesn't have default values for arguments. Adding an argument would cause code refactoring in all the places where.describe
is called.
What's the best solution here? I would also like to implement this PR myself as a small rust exercise.
Not really worth a method IMO:
df = pl.DataFrame({
"a": [1, 2, 3, None],
"b": [1 ,2, 3, 4]
})
df.select([
pl.all().null_count() / pl.count()
])
shape: (1, 2)
┌──────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞══════╪═════╡
│ 0.25 ┆ 0.0 │
└──────┴─────┘
I do agree that it might be a nice addition to describe
.
Sure, probably we don't need a separate method. Anyway, what would be your advice about the .describe
behavior / signature?
I have raised #4820, which implements amongst others count and null_count.
Do you have plans to add the null_count
to the Rust version of describe?
@philss : that seems sensible to me, but I'm not a Rust programmer. So that will have to be someone else. Feel free to raise a PR.