pandera Ability to report a single failure per row for row-level check functions returning boolean series

Is your feature request related to a problem? Please describe.

As established by conversation with @cosmicBboy on Discord, the current handling of custom check functions operating on the whole DataFrame appears to be reporting a failure_case for each column of the row in which the check fails. One could argue that it is more natural to report a single failure case per row when the check function operates at row-level and returns a boolean Series. Clearly one could e.failure_cases.groupby('index')['check'].unique(), but myself and perhaps Niels feel that this could be more intuitive.

In my application, I am parsing failure cases and passing corresponding descriptive errrors back to the user for i) dataframe-level errors (e.g. values of column x must be unique) and ii) sample-level errors (e.g. region Bretagne is invalid for country USA)

I made the following MWE showing how a custom check function returns more than one failure_case for a row-level check failure.

https://gist.github.com/bede/c2cd27a12add680648fde39c427ae752

% python mwe.py
    schema_context   column            check  check_number failure_case            index
0  DataFrameSchema  country  region_is_valid             0          USA  cDNA-VOC-1-v4-1
1  DataFrameSchema   region  region_is_valid             0     Bretagne  cDNA-VOC-1-v4-1

Describe the solution you'd like Custom check functions returning boolean series could generate a single failure case per failed check. This could be achived in a variety of ways, and I invite discussion of how.

Jun 05 '22 16:06 bede

Hi bede, I've been looking into this recently. I found for a two-column comparison, the grouped checks for columns result in what you'd like - a single row for the error. You could extrapolate this to additional columns with further groupby being applied.

However, you will lose index information due to the implementation of dicts for GroupBy objects. (Specifically, failure_cases is set to None for all GroupBy objects). I'm working on a PR to replace the dict: Dataframe | Series signature with DataFrame | Series so that such a group by can retain index information. In the meantime, you can inspect the schema_errors.schema_errors var from the code below and you actually have the raw check_output that you could parse to return something more useful for troubleshooting.


from pandera import DataFrameSchema, Column, Check
from pandera.errors import SchemaErrors

df = DataFrame({'first': [1, 2, 3], 'second': ['a', 'b', 'c'], 'third': ['x', 'y', 'z']})

schema = DataFrameSchema(
    columns={
        'first': Column(checks=[
            Check(lambda g: ~g.isin([1]), groupby=["second"], groups=["a"], name="~isin([1]) when second group is a.")
        ]),
        'second': Column(),
        'third': Column()
    },
)

try:
    schema(df, lazy=True)
except SchemaErrors as e:
    schema_errors = e
    print(schema_errors)

Aug 11 '22 13:08 ipear3

thanks for adding your input here @ipear3, there's this issue that tracks a couple of enhancements, including refactoring the groupby behavior: https://github.com/unionai-oss/pandera/issues/488

Aug 11 '22 14:08 cosmicBboy

I'm working on a PR to replace the dict: Dataframe | Series signature with DataFrame | Series

Hey @ipear3 just FYI, I'm working on a major overhaul of pandera's internals: https://github.com/unionai-oss/pandera/pull/913

It's still a WIP, but moving forward we should make any changes to pandera's behavior off of the core-schema branch. In summary, this PR introduces a pandera.core and pandera.backends module, where core is the pandera schema specification and backends implements the actual validation logic based on the specification.

For example, this method implements the current groupby logic, which we could update to fix #488.

Aug 17 '22 15:08 cosmicBboy

pandera pandera copied to clipboard

Ability to report a single failure per row for row-level check functions returning boolean series

pandera
pandera copied to clipboard