pandera
pandera copied to clipboard
Ability to report a single failure per row for row-level check functions returning boolean series
Is your feature request related to a problem? Please describe.
As established by conversation with @cosmicBboy on Discord, the current handling of custom check functions operating on the whole DataFrame appears to be reporting a failure_case for each column of the row in which the check fails. One could argue that it is more natural to report a single failure case per row when the check function operates at row-level and returns a boolean Series. Clearly one could e.failure_cases.groupby('index')['check'].unique(), but myself and perhaps Niels feel that this could be more intuitive.
In my application, I am parsing failure cases and passing corresponding descriptive errrors back to the user for i) dataframe-level errors (e.g. values of column x must be unique) and ii) sample-level errors (e.g. region Bretagne is invalid for country USA)
I made the following MWE showing how a custom check function returns more than one failure_case for a row-level check failure.
https://gist.github.com/bede/c2cd27a12add680648fde39c427ae752
% python mwe.py
schema_context column check check_number failure_case index
0 DataFrameSchema country region_is_valid 0 USA cDNA-VOC-1-v4-1
1 DataFrameSchema region region_is_valid 0 Bretagne cDNA-VOC-1-v4-1
Describe the solution you'd like Custom check functions returning boolean series could generate a single failure case per failed check. This could be achived in a variety of ways, and I invite discussion of how.
Hi bede, I've been looking into this recently. I found for a two-column comparison, the grouped checks for columns result in what you'd like - a single row for the error. You could extrapolate this to additional columns with further groupby being applied.
However, you will lose index information due to the implementation of dicts for GroupBy objects. (Specifically, failure_cases is set to None for all GroupBy objects). I'm working on a PR to replace the dict: Dataframe | Series signature with DataFrame | Series so that such a group by can retain index information. In the meantime, you can inspect the schema_errors.schema_errors var from the code below and you actually have the raw check_output that you could parse to return something more useful for troubleshooting.
from pandera import DataFrameSchema, Column, Check
from pandera.errors import SchemaErrors
df = DataFrame({'first': [1, 2, 3], 'second': ['a', 'b', 'c'], 'third': ['x', 'y', 'z']})
schema = DataFrameSchema(
columns={
'first': Column(checks=[
Check(lambda g: ~g.isin([1]), groupby=["second"], groups=["a"], name="~isin([1]) when second group is a.")
]),
'second': Column(),
'third': Column()
},
)
try:
schema(df, lazy=True)
except SchemaErrors as e:
schema_errors = e
print(schema_errors)
thanks for adding your input here @ipear3, there's this issue that tracks a couple of enhancements, including refactoring the groupby behavior: https://github.com/unionai-oss/pandera/issues/488
I'm working on a PR to replace the dict: Dataframe | Series signature with DataFrame | Series
Hey @ipear3 just FYI, I'm working on a major overhaul of pandera's internals: https://github.com/unionai-oss/pandera/pull/913
It's still a WIP, but moving forward we should make any changes to pandera's behavior off of the core-schema branch. In summary, this PR introduces a pandera.core and pandera.backends module, where core is the pandera schema specification and backends implements the actual validation logic based on the specification.
For example, this method implements the current groupby logic, which we could update to fix #488.