null_count in describe() does not show zero when there is no null value in the array
Describe the bug
Hello, I'm using the input file datafusion/core/tests/data/cars.csv and running the describe() function. Below is the output of the describe() function:
+------------+-------+-------------------+---------------------+
| describe | car | speed | time |
+------------+-------+-------------------+---------------------+
| count | 25 | 25.0 | 25 |
| null_count | 25 | 25.0 | 25 |
| mean | null | 11.84 | null |
| std | null | 6.456456716600317 | null |
| min | green | 0.0 | 1996-04-12T12:05:03 |
| max | red | 21.5 | 1996-04-12T12:05:15 |
| median | null | 11.0 | null |
+------------+-------+-------------------+---------------------+
I'm wondering, shouldn't the null_count be 0 if there are no null values in the array? Could this be considered a bug?
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The previous implementation appears to be incorrect because the variable count will count all Expr rather than just the null ones.
https://github.com/apache/datafusion/blob/b87f210dbdd90e5f65caefac1eeb053b0f0f612e/datafusion/core/src/dataframe/mod.rs#L532-L539
https://github.com/apache/datafusion/blob/b87f210dbdd90e5f65caefac1eeb053b0f0f612e/datafusion/expr/src/expr_fn.rs#L206-L216
Create a new aggregate function using filter could fix this issue
/// Create an expression to represent the count_null() aggregate function
pub fn count_null(expr: Expr) -> Expr {
Expr::AggregateFunction(AggregateFunction::new(
aggregate_function::AggregateFunction::Count,
vec![expr.clone()],
false,
Some(Box::new(expr.is_null())),
None,
None,
))
}
Hey, I've successfully obtained the correct output following your advice. Can we expect any update or fix to the current describe() functionality?
use datafusion::logical_expr::aggregate_function;
use datafusion::logical_expr::expr::AggregateFunction;
fn count_null(expr: Expr) -> Expr {
Expr::AggregateFunction(AggregateFunction::new(
aggregate_function::AggregateFunction::Count,
vec![expr.clone()],
false,
Some(Box::new(expr.is_null())),
None,
None,
))
}
// null_count aggregation
df.clone().aggregate(
vec![],
original_schema_fields
.clone()
.filter(|f| f.data_type().is_numeric())
.map(|f| count_null(col(f.name())).alias(f.name()))
.collect::<Vec<_>>(),
),
+----------------+------+-------------------+------+
| describe | car | speed | time |
+----------------+------+-------------------+------+
| count | null | 25.0 | null |
| null_count | null | 0.0 | null |
| mean | null | 11.84 | null |
| std | null | 6.456456716600317 | null |
| min | null | 0.0 | null |
| max | null | 21.5 | null |
| median | null | 11.0 | null |
+----------------+------+-------------------+------+