pandera
pandera copied to clipboard
Check.name not used in violaton reporting in column "Check"
Describe the bug Pandera does not use check names when reporting violations, but rather generates an unpredictable description of the performed check that is ambiguous and cannot be easily reconstructed from the parameters used in check creation.
For instance, when I use Check.isin(["A","B","C","D"]) I would not expect its designation in the error report to be isin({'D', 'A', 'B', 'C'}), but rather isin({'A', 'B', 'C', 'D'}) or, even better, the name I have specified for the check.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [No, but I believe it does.] (optional) I have confirmed this bug exists on the master branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
def testSetPanderaCheckName(self):
#Create a dataframe
d = {'col1': ["A", "B", "D", "E"], 'col2': [1,2,3,4]}
df = pandas.DataFrame(data=d, index=[0, 1, 2, 3])
#create a check and verify its name can be correctly set
s="A,B,C,D".split(",")
c = pandera.Check.isin(s,name = "MyTestName")
self.assertEqualWithNote(c.name, "MyTestName")
a_schema = DataFrameSchema({
"col1": Column(str, Check.isin(s, name = "MyTestName"))
})
try:
validated_df = a_schema.validate(df, lazy = True)
except pandera.errors.SchemaErrors as err:
print(err.failure_cases.to_string())
assert "MyTestName" in err.failure_cases["check"].values
Current behavior
pandera.errors.SchemaErrors: A total of 1 schema errors were found.
Error Counts
- schema_component_check: 1
Schema Error Summary
failure_cases n_failure_cases
schema_context column check
Column col1 isin({'D', 'A', 'B', 'C'}) [E] 1
Expected behavior
Pandera.errors.SchemaErrors: A total of 1 schema errors were found.
Error Counts
- schema_component_check: 1
Schema Error Summary
failure_cases n_failure_cases
schema_context column check
Column col1 MyTestName [E] 1
Desktop (please complete the following information):
- OS: [e.g. iOS] Windows 10
- Browser [e.g. chrome, safari] Irrelevant
- Version [e.g. 22] irrelevant
Screenshots
Not applicable.
Additional context
Can create a patch if it will be applied into future versions of pandera. Otherwise, I would prefer an official solution.
built-in checks have a default error argument, which is just a string to identify the check when a SchemaError occurs. If this is set to None then the pandera error reporter will fall back on the check name.
In the case of isin it converts the user-provided iterable to a frozenset, hence the unpredictable nature of the re-ordering of the list you're providing:
https://github.com/pandera-dev/pandera/blob/master/pandera/checks.py#L807-L814
To customize this you can simply do
c = pandera.Check.isin(s, error="MyTestName")
E.g.
import numpy as np
import pandas as pd
import pandera as pa
check = pa.Check.gt(0, error="my_check")
schema = pa.DataFrameSchema({"col1": pa.Column(float, check, nullable=True)}, coerce=True)
try:
schema(pd.DataFrame({"col1": [1, 2, 3, 4, -1]}), lazy=True)
except pa.errors.SchemaErrors as exc:
print(exc.failure_cases)
output:
schema_context column check check_number failure_case index
0 Column col1 my_check 0 -1.0 4