datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

null_count in describe() does not show zero when there is no null value in the array

Open chingren-ooi opened this issue 1 year ago • 2 comments

Describe the bug

Hello, I'm using the input file datafusion/core/tests/data/cars.csv and running the describe() function. Below is the output of the describe() function:

+------------+-------+-------------------+---------------------+
| describe   | car   | speed             | time                |
+------------+-------+-------------------+---------------------+
| count      | 25    | 25.0              | 25                  |
| null_count | 25    | 25.0              | 25                  |
| mean       | null  | 11.84             | null                |
| std        | null  | 6.456456716600317 | null                |
| min        | green | 0.0               | 1996-04-12T12:05:03 |
| max        | red   | 21.5              | 1996-04-12T12:05:15 |
| median     | null  | 11.0              | null                |
+------------+-------+-------------------+---------------------+

I'm wondering, shouldn't the null_count be 0 if there are no null values in the array? Could this be considered a bug?

To Reproduce

No response

Expected behavior

No response

Additional context

No response

chingren-ooi avatar Apr 25 '24 12:04 chingren-ooi

The previous implementation appears to be incorrect because the variable count will count all Expr rather than just the null ones.

https://github.com/apache/datafusion/blob/b87f210dbdd90e5f65caefac1eeb053b0f0f612e/datafusion/core/src/dataframe/mod.rs#L532-L539

https://github.com/apache/datafusion/blob/b87f210dbdd90e5f65caefac1eeb053b0f0f612e/datafusion/expr/src/expr_fn.rs#L206-L216

Create a new aggregate function using filter could fix this issue

/// Create an expression to represent the count_null() aggregate function
pub fn count_null(expr: Expr) -> Expr {
    Expr::AggregateFunction(AggregateFunction::new(
        aggregate_function::AggregateFunction::Count,
        vec![expr.clone()],
        false,
        Some(Box::new(expr.is_null())),
        None,
        None,
    ))
}

Weijun-H avatar Apr 25 '24 15:04 Weijun-H

Hey, I've successfully obtained the correct output following your advice. Can we expect any update or fix to the current describe() functionality?

use datafusion::logical_expr::aggregate_function;
use datafusion::logical_expr::expr::AggregateFunction;
fn count_null(expr: Expr) -> Expr {
    Expr::AggregateFunction(AggregateFunction::new(
        aggregate_function::AggregateFunction::Count,
        vec![expr.clone()],
        false,
        Some(Box::new(expr.is_null())),
        None,
        None,
    ))
}
// null_count aggregation
df.clone().aggregate(
    vec![],
    original_schema_fields
        .clone()
        .filter(|f| f.data_type().is_numeric())
        .map(|f| count_null(col(f.name())).alias(f.name()))
        .collect::<Vec<_>>(),
),
+----------------+------+-------------------+------+
| describe       | car  | speed             | time |
+----------------+------+-------------------+------+
| count          | null | 25.0              | null |
| null_count     | null | 0.0               | null |
| mean           | null | 11.84             | null |
| std            | null | 6.456456716600317 | null |
| min            | null | 0.0               | null |
| max            | null | 21.5              | null |
| median         | null | 11.0              | null |
+----------------+------+-------------------+------+

chingren-ooi avatar Apr 26 '24 04:04 chingren-ooi