pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Check.name not used in violaton reporting in column "Check"

Open dbokal opened this issue 3 years ago • 1 comments
trafficstars

Describe the bug Pandera does not use check names when reporting violations, but rather generates an unpredictable description of the performed check that is ambiguous and cannot be easily reconstructed from the parameters used in check creation.

For instance, when I use Check.isin(["A","B","C","D"]) I would not expect its designation in the error report to be isin({'D', 'A', 'B', 'C'}), but rather isin({'A', 'B', 'C', 'D'}) or, even better, the name I have specified for the check.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.
  • [No, but I believe it does.] (optional) I have confirmed this bug exists on the master branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

    def testSetPanderaCheckName(self):
        #Create a dataframe
        d = {'col1': ["A", "B", "D", "E"], 'col2': [1,2,3,4]}
        df = pandas.DataFrame(data=d, index=[0, 1, 2, 3])
        
        #create a check and verify its name can be correctly set
        s="A,B,C,D".split(",")
        c = pandera.Check.isin(s,name = "MyTestName")
        self.assertEqualWithNote(c.name, "MyTestName")
        
        a_schema = DataFrameSchema({
            "col1": Column(str, Check.isin(s, name = "MyTestName"))         
            })
        try:
            validated_df = a_schema.validate(df, lazy = True)
        except pandera.errors.SchemaErrors as err:
            print(err.failure_cases.to_string())
            assert "MyTestName" in err.failure_cases["check"].values

Current behavior

pandera.errors.SchemaErrors: A total of 1 schema errors were found.

Error Counts

  • schema_component_check: 1

Schema Error Summary

                                             failure_cases  n_failure_cases

schema_context column check
Column col1 isin({'D', 'A', 'B', 'C'}) [E] 1

Expected behavior

Pandera.errors.SchemaErrors: A total of 1 schema errors were found.

Error Counts

  • schema_component_check: 1

Schema Error Summary

                                             failure_cases  n_failure_cases

schema_context column check
Column col1 MyTestName [E] 1

Desktop (please complete the following information):

  • OS: [e.g. iOS] Windows 10
  • Browser [e.g. chrome, safari] Irrelevant
  • Version [e.g. 22] irrelevant

Screenshots

Not applicable.

Additional context

Can create a patch if it will be applied into future versions of pandera. Otherwise, I would prefer an official solution.

dbokal avatar Jun 15 '22 14:06 dbokal

built-in checks have a default error argument, which is just a string to identify the check when a SchemaError occurs. If this is set to None then the pandera error reporter will fall back on the check name.

In the case of isin it converts the user-provided iterable to a frozenset, hence the unpredictable nature of the re-ordering of the list you're providing: https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L807-L814

To customize this you can simply do

c = pandera.Check.isin(s, error="MyTestName")

E.g.

import numpy as np
import pandas as pd
import pandera as pa

check = pa.Check.gt(0, error="my_check")

schema = pa.DataFrameSchema({"col1": pa.Column(float, check, nullable=True)}, coerce=True)

try:
    schema(pd.DataFrame({"col1": [1, 2, 3, 4, -1]}), lazy=True)
except pa.errors.SchemaErrors as exc:
    print(exc.failure_cases)

output:

  schema_context column     check  check_number  failure_case  index
0         Column   col1  my_check             0          -1.0      4

cosmicBboy avatar Jun 16 '22 13:06 cosmicBboy