great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

expect_column_pair_values_to_be_in_set throws exception when row has both column values to be paired missing

Open MaxTh0ma1s opened this issue 11 months ago • 1 comments

Describe the bug Simple expect_column_pair_values_to_be_in_set expectation throws exception when row has both column values to be paired missing

To Reproduce Basic setup, with the following expectation configured ...

{ "expectation_type": "expect_column_pair_values_to_be_in_set", "kwargs": { "column_A": "mycolA", "column_B": "mycolB", "value_pairs_set": [["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]] } }

Sample data to reproduce

id,mycolA,mycolB,valid 1,apple,red,pass 2,apple,green,pass 3,apple,yellow,pass 4,peach,peach,fail 5,banana,yellow,pass 6,banana,black,fail 7,,,fail 8,melon,melon,fail

An exception is raised

    "exception_info": {
      "exception_traceback": "Traceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 548, in _process_direct_and_bundled_metric_computation_configurations\n    ] = metric_computation_configuration.metric_fn(  # type: ignore[misc] # F not callable\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/expectations/metrics/map_metric_provider/map_condition_auxilliary_methods.py\", line 201, in _pandas_map_condition_query\n    domain_values_df_filtered = domain_records_df[boolean_mapped_unexpected_values]\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3884, in __getitem__\n    return self._getitem_bool_array(key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/frame.py\", line 3940, in _getitem_bool_array\n    key = check_bool_indexer(self.index, key)\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/pandas/core/indexing.py\", line 2575, in check_bool_indexer\n    raise IndexingError(\npandas.errors.IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/validator/validation_graph.py\", line 285, in _resolve\n    self._execution_engine.resolve_metrics(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 283, in resolve_metrics\n    return self._process_direct_and_bundled_metric_computation_configurations(\n  File \"/opt/anaconda3/envs/my-repo/lib/python3.9/site-packages/great_expectations/execution_engine/execution_engine.py\", line 552, in _process_direct_and_bundled_metric_computation_configurations\n    raise gx_exceptions.MetricResolutionError(\ngreat_expectations.exceptions.exceptions.MetricResolutionError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).\n",
      "exception_message": "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).",
      "raised_exception": true
    }

Expected behavior I do not expect basic use of this expectation on simple data to throw an exception.

Environment (please complete the following information):

  • Operating System: MacOS
  • Great Expectations Version: 0.18.8
  • Data Source: .csv sample data listed above
  • Pandas Version: 2.1.4

Additional context Note issue was reproduced by Rachel House https://discourse.greatexpectations.io/t/how-to-specify-expect-column-pair-values-to-be-in-set-value-pairs-set-input-arg-via-json/1621/5

MaxTh0ma1s avatar Mar 05 '24 18:03 MaxTh0ma1s

Hi @MaxTh0ma1s, thanks for creating this issue. I discussed the error and behavior internally with Engineering today - they're now aware of it, but I don't know when a fix will be prioritized. I added my code to reproduce the issue to aid investigation.

As a workaround for now, I suggest modifying your source dataframe using .fillna() to replace the NaN values with another suitable non-null value (as mentioned in the associated Discourse thread).

Reproduced using:

great-expectations==0.18.8
pandas==2.1.3

Code to reproduce:

import pandas as pd
import great_expectations as gx

context = gx.get_context()

# Dataset containing row with NaNs as final row.
data_1 = [
    { "idx" : 1, "fruit" : "apple", "color" : "red" },
    { "idx" : 2, "fruit" : "apple", "color" : "green" },
    { "idx" : 3, "fruit" : "apple", "color" : "yellow" },
    { "idx" : 4, "fruit" : "peach", "color" : "red" },
    { "idx" : 5, "fruit" : "banana", "color" : "yellow" },
    { "idx" : 6, "fruit" : "banana", "color" : "black" },
    { "idx" : 7},
]

df_1 = pd.DataFrame(data=data_1)

# Dataset containing row with NaNs, but not as final row.
data_2 = [
    { "idx" : 1, "fruit" : "apple", "color" : "red" },
    { "idx" : 2, "fruit" : "apple", "color" : "green" },
    { "idx" : 3, "fruit" : "apple", "color" : "yellow" },
    { "idx" : 4, "fruit" : "peach", "color" : "peach" },
    { "idx" : 5, "fruit" : "banana", "color" : "yellow" },
    { "idx" : 6, "fruit" : "banana", "color" : "black" },
    { "idx" : 7},
    { "idx" : 8, "fruit" : "melon", "color" : "melon" },
]

df_2 = pd.DataFrame(data=data_2)

DATA_SOURCE_NAME = "pandas-datasource"
DATA_ASSET_NAME = "pandas-dataframe"
EXPECTATION_SUITE_NAME = "expectations"
CHECKPOINT_NAME = "checkpoint"

data_source = context.sources.add_pandas(name=DATA_SOURCE_NAME)
data_asset = data_source.add_dataframe_asset(name=DATA_ASSET_NAME)

# When using df_1, Checkpoint runs successfully.
# Using df_2 causes the Checkpoint to error when running the Expectation.
batch_request = data_asset.build_batch_request(dataframe=df_2)

suite = context.add_or_update_expectation_suite(EXPECTATION_SUITE_NAME)

set_expectation = gx.core.ExpectationConfiguration(
    expectation_type="expect_column_pair_values_to_be_in_set",
    kwargs={
        "column_A": "fruit",
        "column_B": "color",
        "value_pairs_set": [["apple", "red"], ["apple", "green"], ["apple", "yellow"], ["banana", "yellow"]]
    }
)

suite.add_expectation_configurations([set_expectation])

context.update_expectation_suite(expectation_suite=suite)

checkpoint_config = {
	"name": CHECKPOINT_NAME,
	"action_list": [],
	"validations": [{
		"expectation_suite_name": suite.expectation_suite_name,
		"batch_request": {
			"datasource_name": data_source.name,
			"data_asset_name": data_asset.name,
		},		
	}],
	"config_version": 1,
	"class_name": "Checkpoint"
}

checkpoint = context.add_or_update_checkpoint(**checkpoint_config)

checkpoint_result = checkpoint.run()

validation_result_name = list(checkpoint_result["run_results"].keys())[0]
checkpoint_result["run_results"][validation_result_name]["validation_result"]

rachhouse avatar Mar 06 '24 21:03 rachhouse

Hello @MaxTh0ma1s. With the launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.

To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).

You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.

Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗

molliemarie avatar Aug 23 '24 00:08 molliemarie