great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

expect_column_mean_to_be_between expectation unsuccessful for empty DataFrame

Open mlondschien opened this issue 3 years ago • 2 comments

Describe the bug The expect_column_mean_to_be_between expectations fails if used together with a row_condition that evaluates False for all rows. This is inconsistent with other expectations.

To Reproduce:

In [1]: import great_expectations as ge
   ...: import pandas as pd
   ...: 
   ...: df = pd.DataFrame({"i": [0, 1, 0, 1], "y": ["a", "a", "a", "a"]})
   ...: 
   ...: expectations = [
   ...:     {
   ...:         "expectation_type": "expect_column_values_to_not_be_null",
   ...:         "kwargs": {
   ...:             "column": "i",
   ...:             "row_condition": "y == 'b'",
   ...:             "condition_parser": "pandas",
   ...:         },
   ...:     },
   ...:     {
   ...:         "expectation_type": "expect_column_values_to_be_null",
   ...:         "kwargs": {
   ...:             "column": "i",
   ...:             "row_condition": "y == 'b'",
   ...:             "condition_parser": "pandas",
   ...:         },
   ...:     },
   ...:     {
   ...:         "expectation_type": "expect_column_values_to_be_in_set",
   ...:         "kwargs": {
   ...:             "column": "i",
   ...:             "value_set": [0, 1],
   ...:             "row_condition": "y == 'b'",
   ...:             "condition_parser": "pandas",
   ...:         },
   ...:     },
   ...:     {
   ...:         "expectation_type": "expect_column_mean_to_be_between",
   ...:         "kwargs": {
   ...:             "column": "i",
   ...:             "min_value": 0,
   ...:             "max_value": 1,
   ...:             "row_condition": "y == 'b'",
   ...:             "condition_parser": "pandas",
   ...:         },
   ...:     },
   ...: ]
   ...: 
   ...: expectation_suite = ge.core.ExpectationSuite(
   ...:     "expectation_suite",
   ...:     expectations=[ge.core.ExpectationConfiguration(**e) for e in expectations],
   ...: )
   ...: validation_results = ge.from_pandas(df).validate(expectation_suite)
   ...: 
   ...: [result.success for result in validation_results.results]
Out[1]: [True, True, True, False]

Expected Behaviour: The expect_column_mean_to_be_between expectation should succeed like other expectations.

Environment (please complete the following information):

 ~ $ conda list great-expectations
# packages in environment at /home/mlondschien/anaconda3/envs/quantcore.thek:
#
# Name                    Version                   Build  Channel
great-expectations        0.13.19            pyha770c72_0    conda-forge

Additional context We apply the same expectation suite on different subsets of a large table. Some expectations do not make sense for a specific subset, so we use the row_condition to filter these.

Details:

In [2]: validation_results Out[2]: { "statistics": { "evaluated_expectations": 4, "successful_expectations": 3, "unsuccessful_expectations": 1, "success_percent": 75.0 }, "results": [ { "expectation_config": { "expectation_type": "expect_column_values_to_not_be_null", "kwargs": { "column": "i", "row_condition": "y == 'b'", "condition_parser": "pandas" }, "meta": {} }, "result": { "element_count": 0, "unexpected_count": 0, "unexpected_percent": null, "partial_unexpected_list": [] }, "success": true, "meta": {}, "exception_info": { "raised_exception": false, "exception_message": null, "exception_traceback": null } }, { "expectation_config": { "expectation_type": "expect_column_values_to_be_null", "kwargs": { "column": "i", "row_condition": "y == 'b'", "condition_parser": "pandas" }, "meta": {} }, "result": { "element_count": 0, "unexpected_count": 0, "unexpected_percent": null, "partial_unexpected_list": [] }, "success": true, "meta": {}, "exception_info": { "raised_exception": false, "exception_message": null, "exception_traceback": null } }, { "expectation_config": { "expectation_type": "expect_column_values_to_be_in_set", "kwargs": { "column": "i", "value_set": [ 0, 1 ], "row_condition": "y == 'b'", "condition_parser": "pandas" }, "meta": {} }, "result": { "element_count": 0, "missing_count": 0, "missing_percent": null, "unexpected_count": 0, "unexpected_percent": null, "unexpected_percent_nonmissing": null, "partial_unexpected_list": [] }, "success": true, "meta": {}, "exception_info": { "raised_exception": false, "exception_message": null, "exception_traceback": null } }, { "expectation_config": { "expectation_type": "expect_column_mean_to_be_between", "kwargs": { "column": "i", "min_value": 0, "max_value": 1, "row_condition": "y == 'b'", "condition_parser": "pandas" }, "meta": {} }, "result": { "observed_value": null, "element_count": 0, "missing_count": null, "missing_percent": null }, "success": false, "meta": {}, "exception_info": { "raised_exception": false, "exception_message": null, "exception_traceback": null } } ], "evaluation_parameters": {}, "success": false, "meta": { "great_expectations_version": "0.13.2", "expectation_suite_name": "expectation_suite", "run_id": { "run_time": "2021-04-26T08:14:57.037509+00:00", "run_name": null }, "batch_kwargs": { "ge_batch_id": "7cc86ff4-a667-11eb-9e90-482ae30df8e3" }, "batch_markers": {}, "batch_parameters": {}, "validation_time": "20210426T081457.037357Z" } }

mlondschien avatar Apr 26 '21 08:04 mlondschien

@mlondschien Thank you for reporting this!

eugmandel avatar May 03 '21 15:05 eugmandel

I hit this for expect_column_values_to_be_in_type_list.py also, so wonder if this may exist for many expectations?

Would it make sense for expectations on column values (or aggregates thereof) to exit early with success on empty dataframes?

shearer12345 avatar Sep 10 '21 08:09 shearer12345

Hey @mlondschien ! After significant discussion on our philosophy in this are, we believe the behavior you're seeing to be the correct one, namely that:

  • Aggregate Expectations on Empty Dataframes → Fail

  • Aggregate Expectations with Row Conditions that return Empty Dataframes → Fail

  • Map Expectations on Empty Dataframes → If there are no rows, the expectation should pass

  • Map Expectations with Row Conditions that Return Empty Dataframes → If there are no rows, the expectation should pass

We believe this should hold true for the majority of extant expectations, and will view behavior outside of this paradigm as unexpected.

austiezr avatar Nov 07 '22 18:11 austiezr