great_expectations icon indicating copy to clipboard operation
great_expectations copied to clipboard

[Feature] Optionally Deleting Unexpected Rows

Open jdimatteo opened this issue 2 years ago • 2 comments

Is your feature request related to a problem? Please describe.

For some kinds of validation, we want to delete rows that fail to meet expectations. For example, a workflow may be to:

  1. copy a snapshot of the data,
  2. run validations on the copy,
  3. produce a Data Docs report of all the failed expectations,
  4. delete rows from the copy that fail expectations.

Describe the solution you'd like

We'd like each expectation to configurably delete unexpected rows. For each expectation, the Data Docs report should show when data was deleted and how many rows (possibly overlapping with delete counts for other expectations).

Describe alternatives you've considered

We've considered using unexpected_list and/or unexpected_rows in the ExpectationValidationResult, however Great Expectations already computes the SQLAlchemy where ClauseList (e.g. sa.and_(min_value <= column, column <= max_value)) so it seems more natural for Great Expectations to orchestrate the deletes. Another option might be for Great Expetations to expose the ClauseList and/or a call back function to be called using this.

Additional context

For a concrete example to discuss this feature, please see these files. To create a sqlite db used in this example, please first run create_sqlite_db.py.

The example database has a single table with the 4 rows shown by query.py, e.g.

(venv) jdimatteo@erel:~/dev/great_expectations_fork_2/feature_example$ python query.py 
(1, 'James', 100)
(2, 'Mary', 61)
(3, 'Robert', 75)
(4, 'Patricia', 175)

Great Expectations is run with great_expectations_example.py, e.g.

(venv) jdimatteo@erel:~/dev/great_expectations_fork_2/feature_example$ python great_expectations_example.py 
Calculating Metrics: 100%|██████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 356.03it/s]
All expectations succeeded? False
Age result:
	element_count: 4
	unexpected_count: 1
	unexpected_percent: 25.0
	partial_unexpected_list: [175]
	missing_count: 0
	missing_percent: 0.0
	unexpected_percent_total: 25.0
	unexpected_percent_nonmissing: 25.0
	unexpected_rows: [(4, 'Patricia', 175)]
	partial_unexpected_index_list: None
	partial_unexpected_counts: [{'value': 175, 'count': 1}]
	unexpected_list: [175]
	unexpected_index_list: None
(venv) jdimatteo@erel:~/dev/great_expectations_fork_2/feature_example$

One possible implementation of this feature might be to add support for expectations to take a delete_unexpected_rows argument, e.g.

(venv) jdimatteo@erel:~/dev/great_expectations_fork_2/feature_example$ git diff -U6
diff --git a/feature_example/great_expectations_example.py b/feature_example/great_expectations_example.py
index f8caff28c..9a6e90da1 100644
--- a/feature_example/great_expectations_example.py
+++ b/feature_example/great_expectations_example.py
@@ -56,12 +56,13 @@ suite.add_expectation(
     expectation_configuration=ExpectationConfiguration(
         expectation_type="expect_column_values_to_be_between",
         kwargs={
             "column": "age",
             "min_value": 0,
             "max_value": 150,
+            "delete_unexpected_rows": True,
         },
     )
 )
 context.save_expectation_suite(
     expectation_suite=suite, expectation_suite_name=SUITE_AND_ASSET_NAME
 )
(venv) jdimatteo@erel:~/dev/great_expectations_fork_2/feature_example$ 

After running with "delete_unexpected_rows": True, the unexpected row (4, 'Patricia', 175) would be deleted so query.py would show:

(venv) jdimatteo@erel:~/dev/great_expectations_fork_2/feature_example$ python query.py 
(1, 'James', 100)
(2, 'Mary', 61)
(3, 'Robert', 75)

The Data Docs report would show that the unexpected rows have been deleted, e.g. possibly like this:

image

The ExpectationValidationResult would also indicate that rows were deleted.

Some complications include:

  1. While many column expectations naturally translate into deletes some don't, e.g. expect_column_values_to_be_of_type doesn't have clear delete unexpected records behavior and for expect_column_values_to_be_unique it is less clear which rows to delete to resolve the unexpected validation. One possible solution is to only add delete_unexpected_rows support for expectations where it is natural and unambiguous to do so.
  2. Deleting rows may impact the expectation results, e.g. if the same row fails two different expectations then one expectation may be impacted by another's deletion. One possible solution is to deleted the unexpected rows after all expectations are run so that the Data Docs report is not missing any failed expectation info, and the Data Docs / ExpectationValidationResult may or may not distinguish between how many rows were unexpected vs how many rows were deleted for that expectation.
  3. Deleting rows may interact with other expectations such that the expectations no longer fail. For example, expect_column_values_to_be_unique may pass after deleting records that failed for some other expectation reason, which could be confusing. One possible solution is to clearly document this possibility and to consistently report the expectation results before any deletions.
  4. It is generally useful to know exactly what rows were deleted. One possible solution is to include deleted_rows (similar to unexpected_rows) in the ExpectationValidationResult. It might also be helpful to configure a max number of rows to delete (e.g. 100) so that if there are more than that limit nothing is deleted -- this case would need to be clearly identified in the ExpectationValidationResult and could prevent arbitrarily large number of rows being included in the ExpectationValidationResult, and this has interactions with https://github.com/great-expectations/great_expectations/issues/4185.

We intend to use this feature in combination with #4181, #4185, and #4186.

jdimatteo avatar Feb 10 '22 17:02 jdimatteo

Hey @jdimatteo ! Thanks for reaching out with these; a lot of super interesting functionality here. We'll review internally over the next week and continue the conversation.

austiezr avatar Feb 11 '22 16:02 austiezr

Hey @jdimatteo, We have reviewed this item and added it to our feature roadmap. We do not have an estimated time to start work on this, but we will notify you when we do.

kyleaton avatar Aug 29 '22 14:08 kyleaton

Added to roadmap.

rdodev avatar Mar 07 '23 20:03 rdodev