great_expectations Invalid behavior for groups of expectations

Invalid behavior for groups of expectations

Open milosz-dm opened this issue 2 years ago • 12 comments

There is a strange behavior for groups of expectations since 0.13.42 version. It works on 0.13.35 & 0.13.41. If you run expectations one by one they success, but if you run them in a group it may fail.

Code to reproduce:

import pandas as pd
from great_expectations import DataContext
from great_expectations.core.batch import RuntimeBatchRequest

ge_context = DataContext(context_root_dir="YOUR_PATH")
my_df = pd.DataFrame(
    {
        "flaot_as_obj": pd.Series([1.], dtype="object"),
        "flaot_as_obj 80%": pd.Series([1.], dtype="object"),
        "some_float": pd.Series([15.0], dtype="float64"),
    }
)

my_type_list =  [
    "FLOAT",
    "FLOAT4",
    "FLOAT8",
    "FLOAT64",
    "DOUBLE",
    "DOUBLE_PRECISION",
    "NUMERIC",
    "FloatType",
    "DoubleType",
    "float_",
    "float16",
    "float32",
    "float64",
    "number",
    "DECIMAL",
    "REAL"
]

batch_request: dict = {
    "datasource_name": "pandas_datasource",
    "data_connector_name": "runtime_data_connector",
    "data_asset_name": "not_relevant_for_pandas_datasource",
    "runtime_parameters": {"batch_data": my_df},
    "batch_identifiers": {"pipeline_step": None, "job_id": None},
}

runtime_batch_request: RuntimeBatchRequest = RuntimeBatchRequest(**batch_request)
empty_suite = ge_context.get_expectation_suite("empty.error")

# Success
ge_validator = ge_context.get_validator(batch_request=runtime_batch_request, expectation_suite=empty_suite)
ge_validator.expect_column_values_to_not_be_null(column="flaot_as_obj 80%", result_format={"result_format": "BOOLEAN_ONLY"})
ge_validator.expect_column_values_to_be_in_type_list(column="flaot_as_obj", type_list=my_type_list)
ge_validator.validate()

# Fails
ge_validator = ge_context.get_validator(batch_request=runtime_batch_request, expectation_suite=empty_suite)
ge_validator.expect_column_values_to_be_in_type_list(column="flaot_as_obj", type_list=my_type_list)
ge_validator.expect_column_values_to_not_be_null(column="flaot_as_obj 80%", result_format={"result_format": "BOOLEAN_ONLY"})
ge_validator.validate()

empty_error.json file:

{
  "data_asset_type": null,
  "expectation_suite_name": "empty.error",
  "expectations": [],
  "ge_cloud_id": null,
  "meta": {
    "great_expectations_version": "0.13.35"
  }
}

Additional data source in great_expectations.yml:

config_version: 3

# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/en/latest/reference/core_concepts/datasource.html
datasources:
  pandas_datasource:
    execution_engine:
      module_name: great_expectations.execution_engine
      class_name: PandasExecutionEngine
    module_name: great_expectations.datasource
    class_name: Datasource
    data_connectors:
      runtime_data_connector:
        class_name: RuntimeDataConnector
        batch_identifiers:
          - pipeline_step
          - job_id

I would expect that in both cases the validation runs with success. I run it on Linux on version 0.13.42 & 0.14.12

Apr 04 '22 12:04 milosz-dm

Hi @milosz-dm - thanks for the question! What happens when you run these expectations without the result_format set to BOOLEAN_ONLY? Could you share the result of how these Expectations are failing?

Apr 06 '22 18:04 talagluck

Is the only difference between the success and the failure the order in which you've run the Expectations? Are you running those directly in sequence?

Apr 06 '22 18:04 talagluck

Hi @milosz-dm - thanks for the question! What happens when you run these expectations without the result_format set to BOOLEAN_ONLY? Could you share the result of how these Expectations are failing?

@talagluck - good catch! The following code (without result_format) works as expected.

ge_validator = ge_context.get_validator(batch_request=runtime_batch_request, expectation_suite=empty_suite)
ge_validator.expect_column_values_to_be_in_type_list(column="flaot_as_obj", type_list=my_type_list)
ge_validator.expect_column_values_to_not_be_null(column="flaot_as_obj 80%")
ge_validator.validate()

PFB the error I have once the result_format if passed:

{
  "success": false,
  "evaluation_parameters": {},
  "results": [
    {
      "result": {
        "element_count": 1,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": [],
        "missing_count": 0,
        "missing_percent": 0.0,
        "unexpected_percent_total": 0.0,
        "unexpected_percent_nonmissing": 0.0
      },
      "success": true,
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      },
      "expectation_config": {
        "expectation_type": "expect_column_values_to_be_in_type_list",
        "kwargs": {
          "column": "flaot_as_obj",
          "type_list": [
            "FLOAT",
            "FLOAT4",
            "FLOAT8",
            "FLOAT64",
            "DOUBLE",
            "DOUBLE_PRECISION",
            "NUMERIC",
            "FloatType",
            "DoubleType",
            "float_",
            "float16",
            "float32",
            "float64",
            "number",
            "DECIMAL",
            "REAL"
          ],
          "batch_id": "54857ec4ad5ebccd81add38b256436f8"
        },
        "meta": {}
      },
      "meta": {}
    },
    {
      "result": {},
      "success": false,
      "exception_info": {
        "exception_traceback": "Traceback (most recent call last):\n  File \"MY_PYTHON_PATH/python3.7/site-packages/great_expectations/validator/validator.py\", line 824, in graph_validate\n    runtime_configuration=runtime_configuration,\n  File \"MY_PYTHON_PATH/python3.7/site-packages/great_expectations/core/expectation_configuration.py\", line 1370, in metrics_validate\n    execution_engine=execution_engine,\n  File \"MY_PYTHON_PATH/python3.7/site-packages/great_expectations/expectations/expectation.py\", line 699, in metrics_validate\n    provided_metrics[name] = metrics[metric_edge_key.id]\nKeyError: ('column_values.nonnull.unexpected_values', '8267b7b1e020d4eca9a7fa70f1aa6f94', \"result_format={'result_format': 'BASIC', 'partial_unexpected_count': 20, 'include_unexpected_rows': False}\")\n",
        "exception_message": "('column_values.nonnull.unexpected_values', '8267b7b1e020d4eca9a7fa70f1aa6f94', \"result_format={'result_format': 'BASIC', 'partial_unexpected_count': 20, 'include_unexpected_rows': False}\")",
        "raised_exception": true
      },
      "expectation_config": {
        "expectation_type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "column": "flaot_as_obj 80%",
          "result_format": {
            "result_format": "BOOLEAN_ONLY"
          },
          "batch_id": "54857ec4ad5ebccd81add38b256436f8"
        },
        "meta": {}
      },
      "meta": {}
    }
  ],
  "statistics": {
    "evaluated_expectations": 2,
    "successful_expectations": 1,
    "unsuccessful_expectations": 1,
    "success_percent": 50.0
  },
  "meta": {
    "great_expectations_version": "0.14.12",
    "expectation_suite_name": "empty.error",
    "run_id": {
      "run_name": null,
      "run_time": "2022-04-07T09:26:40.256002+00:00"
    },
    "batch_spec": {
      "data_asset_name": "not_relevant_for_pandas_datasource",
      "batch_data": "PandasDataFrame"
    },
    "batch_markers": {
      "ge_load_time": "20220407T092634.522147Z",
      "pandas_data_fingerprint": "824b4e3279fefb57fba55aa887261ff3"
    },
    "active_batch_definition": {
      "datasource_name": "pandas_datasource",
      "data_connector_name": "runtime_data_connector",
      "data_asset_name": "not_relevant_for_pandas_datasource",
      "batch_identifiers": {
        "pipeline_step": null,
        "job_id": null
      }
    },
    "validation_time": "20220407T092640.255743Z"
  }
}

Apr 07 '22 09:04 milosz-dm

Is the only difference between the success and the failure the order in which you've run the Expectations? Are you running those directly in sequence?

IMO the order difference is the only one. You can play with the code I added, to be 100% sure.

Apr 07 '22 09:04 milosz-dm

@talagluck - any update on it? Do you know when it can be fixed?

May 27 '22 14:05 milosz-dm

Is this issue still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity.

It will be closed if no further activity occurs. Thank you for your contributions 🙇

Aug 05 '22 02:08 github-actions[bot]

@talagluck - do you have any update? The issue has been marked as stale.

Aug 05 '22 07:08 milosz-dm

Hi @milosz-dm - apologies, this was marked as stale inadvertently.

I believe we have tracked down the source of this issue internally, but it might be somewhat complex to fix, so I don't yet have an exact date, though it is in our queue. In the meantime, we recommend using Checkpoints for validation instead of relying too closely on the validator.validate() method.

Aug 05 '22 11:08 talagluck

I would love to use checkpoints, but they do not work well with pandas DF as a source.

Aug 05 '22 11:08 milosz-dm

@milosz-dm - the functionality is nearly the same between pandas dataframes and other data types. The only difference is that a dataframe can not be baked into the Checkpoint config, and so must be passed into the call for run_checkpoint. If you are using validator.validate() which is an in-memory operation, then this should not be a big deal. Is there another issue you are facing? If so, please feel free to open a new Github Issue.

Aug 08 '22 12:08 talagluck

OK, will see when you fix the issue. I remember I could not use checkopoints, because my pandas are quite big and checkpoints were logging entire dataframe via logger.debug function somewhere in the GE code. That was quite long time ago so I do not remember the exact function.

Aug 08 '22 12:08 milosz-dm

OK - we are looking into this issue, but the Checkpoint issue with pandas dataframes was solved quite some time ago.

Aug 08 '22 12:08 talagluck

Hi @milosz-dm - thanks again for raising. As mentioned above, the validator.validate() workflow is not preferred, and so we recommend using Checkpoints to validate your data. We still don't yet have a date on a fix for this, since is not the preferred workflow. I'm going to close this Issue for now - that said, we have logged this feedback and will be in touch if we are able to prioritize it.

Mar 29 '23 11:03 talagluck

great_expectations great_expectations copied to clipboard

Invalid behavior for groups of expectations

great_expectations
great_expectations copied to clipboard