pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Doesn't show all errors even though lazy = True + `coercing` twice on a column

Open ng-henry opened this issue 2 years ago • 3 comments

Is your feature request related to a problem? Please describe. Pandera doesn't show all the errors when validating even though we set the lazy = True flag.

Example:

import pandera as pa
schema = pa.DataFrameSchema({"float": pa.Column("float", coerce = True, nullable=False)})
df = pd.DataFrame({"float": ["not a float", 50, None]})
try:
    schema.validate(df, lazy = True)
except pa.errors.SchemaErrors as err:
    print(err.failure_cases)

Here, err.failure_cases only has 1 failure for the "not a float" cell, but nothing for the None cell even though the column is marked as nullable = False.

Describe the solution you'd like err.failure_cases should contain two error rows. One for the "not a float" cell and one for the "None" cell.

Additional context Looks like we are coercing the DF twice:

  • once in Line 670 pandera/schemas.py. Here, it catches the error and collects the error instead of raising it right away if lazy
  • once in Line 205 pandera/schema_components.py. Here, it doesn't catch the error and raises it right away, even though lazy is True.

ng-henry avatar Jul 05 '22 19:07 ng-henry

I believe coercion only happens once... the DataFrameSchema.coerce_dtype calls Column.coerce_dtype here.

The behavior you're seeing is actually intended behavior given pandera's current execution model.

In summary, if a column cannot be coerced to the intended type, in this case float, pandera won't apply any of the downstream Checks to that column (which is why the nullability check is not picked up in failure_cases.

The reasoning is that if the column is not even of the correct type, then it's reasonable to assume that the validation checks (which assume that type) wouldn't work on that column.

This is similar to the problem @azhakhan posted about in #874. You can use the custom Float type described here to make sure the string is correctly handled as a null value... after that the nullable = False check should take effect.

As stated in that issue, I'd definitely support extending the pandera Float datatype to accept additional arguments so it would consider un-coercible values as null (see here).

cosmicBboy avatar Jul 05 '22 21:07 cosmicBboy

I believe coercion only happens once

okay so it looks like coercion does happen twice 😅 this is definitely a bug... will look into correcting that.

Re: the execution model, I'll also look into relaxing this a little bit and see how it feels

cosmicBboy avatar Jul 05 '22 22:07 cosmicBboy

@cosmicBboy any updates on the twice coercion bug?

I managed to fix it by commenting out line 670 in pandera/schemas.py. Thus, coercion only happens once at pandera/schema_components.py. However, this change changes the order of the failure cases reported (not the actual rows, just the order that the rows are in). Therefore it breaks a lot of the existing tests.

ng-henry avatar Jul 20 '22 21:07 ng-henry