great_expectations
great_expectations copied to clipboard
expect_column_mean_to_be_between expectation unsuccessful for empty DataFrame
Describe the bug
The expect_column_mean_to_be_between
expectations fails if used together with a row_condition
that evaluates False
for all rows. This is inconsistent with other expectations.
To Reproduce:
In [1]: import great_expectations as ge
...: import pandas as pd
...:
...: df = pd.DataFrame({"i": [0, 1, 0, 1], "y": ["a", "a", "a", "a"]})
...:
...: expectations = [
...: {
...: "expectation_type": "expect_column_values_to_not_be_null",
...: "kwargs": {
...: "column": "i",
...: "row_condition": "y == 'b'",
...: "condition_parser": "pandas",
...: },
...: },
...: {
...: "expectation_type": "expect_column_values_to_be_null",
...: "kwargs": {
...: "column": "i",
...: "row_condition": "y == 'b'",
...: "condition_parser": "pandas",
...: },
...: },
...: {
...: "expectation_type": "expect_column_values_to_be_in_set",
...: "kwargs": {
...: "column": "i",
...: "value_set": [0, 1],
...: "row_condition": "y == 'b'",
...: "condition_parser": "pandas",
...: },
...: },
...: {
...: "expectation_type": "expect_column_mean_to_be_between",
...: "kwargs": {
...: "column": "i",
...: "min_value": 0,
...: "max_value": 1,
...: "row_condition": "y == 'b'",
...: "condition_parser": "pandas",
...: },
...: },
...: ]
...:
...: expectation_suite = ge.core.ExpectationSuite(
...: "expectation_suite",
...: expectations=[ge.core.ExpectationConfiguration(**e) for e in expectations],
...: )
...: validation_results = ge.from_pandas(df).validate(expectation_suite)
...:
...: [result.success for result in validation_results.results]
Out[1]: [True, True, True, False]
Expected Behaviour:
The expect_column_mean_to_be_between
expectation should succeed like other expectations.
Environment (please complete the following information):
~ $ conda list great-expectations
# packages in environment at /home/mlondschien/anaconda3/envs/quantcore.thek:
#
# Name Version Build Channel
great-expectations 0.13.19 pyha770c72_0 conda-forge
Additional context
We apply the same expectation suite on different subsets of a large table. Some expectations do not make sense for a specific subset, so we use the row_condition
to filter these.
Details:
In [2]: validation_results Out[2]: { "statistics": { "evaluated_expectations": 4, "successful_expectations": 3, "unsuccessful_expectations": 1, "success_percent": 75.0 }, "results": [ { "expectation_config": { "expectation_type": "expect_column_values_to_not_be_null", "kwargs": { "column": "i", "row_condition": "y == 'b'", "condition_parser": "pandas" }, "meta": {} }, "result": { "element_count": 0, "unexpected_count": 0, "unexpected_percent": null, "partial_unexpected_list": [] }, "success": true, "meta": {}, "exception_info": { "raised_exception": false, "exception_message": null, "exception_traceback": null } }, { "expectation_config": { "expectation_type": "expect_column_values_to_be_null", "kwargs": { "column": "i", "row_condition": "y == 'b'", "condition_parser": "pandas" }, "meta": {} }, "result": { "element_count": 0, "unexpected_count": 0, "unexpected_percent": null, "partial_unexpected_list": [] }, "success": true, "meta": {}, "exception_info": { "raised_exception": false, "exception_message": null, "exception_traceback": null } }, { "expectation_config": { "expectation_type": "expect_column_values_to_be_in_set", "kwargs": { "column": "i", "value_set": [ 0, 1 ], "row_condition": "y == 'b'", "condition_parser": "pandas" }, "meta": {} }, "result": { "element_count": 0, "missing_count": 0, "missing_percent": null, "unexpected_count": 0, "unexpected_percent": null, "unexpected_percent_nonmissing": null, "partial_unexpected_list": [] }, "success": true, "meta": {}, "exception_info": { "raised_exception": false, "exception_message": null, "exception_traceback": null } }, { "expectation_config": { "expectation_type": "expect_column_mean_to_be_between", "kwargs": { "column": "i", "min_value": 0, "max_value": 1, "row_condition": "y == 'b'", "condition_parser": "pandas" }, "meta": {} }, "result": { "observed_value": null, "element_count": 0, "missing_count": null, "missing_percent": null }, "success": false, "meta": {}, "exception_info": { "raised_exception": false, "exception_message": null, "exception_traceback": null } } ], "evaluation_parameters": {}, "success": false, "meta": { "great_expectations_version": "0.13.2", "expectation_suite_name": "expectation_suite", "run_id": { "run_time": "2021-04-26T08:14:57.037509+00:00", "run_name": null }, "batch_kwargs": { "ge_batch_id": "7cc86ff4-a667-11eb-9e90-482ae30df8e3" }, "batch_markers": {}, "batch_parameters": {}, "validation_time": "20210426T081457.037357Z" } }
@mlondschien Thank you for reporting this!
I hit this for expect_column_values_to_be_in_type_list.py
also, so wonder if this may exist for many expectations?
Would it make sense for expectations on column values (or aggregates thereof) to exit early with success on empty dataframes?
Hey @mlondschien ! After significant discussion on our philosophy in this are, we believe the behavior you're seeing to be the correct one, namely that:
-
Aggregate Expectations on Empty Dataframes → Fail
-
Aggregate Expectations with Row Conditions that return Empty Dataframes → Fail
-
Map Expectations on Empty Dataframes → If there are no rows, the expectation should pass
-
Map Expectations with Row Conditions that Return Empty Dataframes → If there are no rows, the expectation should pass
We believe this should hold true for the majority of extant expectations, and will view behavior outside of this paradigm as unexpected.