pandera
pandera copied to clipboard
Doesn't show all errors even though lazy = True + `coercing` twice on a column
Is your feature request related to a problem? Please describe. Pandera doesn't show all the errors when validating even though we set the lazy = True flag.
Example:
import pandera as pa
schema = pa.DataFrameSchema({"float": pa.Column("float", coerce = True, nullable=False)})
df = pd.DataFrame({"float": ["not a float", 50, None]})
try:
schema.validate(df, lazy = True)
except pa.errors.SchemaErrors as err:
print(err.failure_cases)
Here, err.failure_cases
only has 1 failure for the "not a float" cell, but nothing for the None
cell even though the column is marked as nullable = False.
Describe the solution you'd like
err.failure_cases
should contain two error rows. One for the "not a float" cell and one for the "None" cell.
Additional context Looks like we are coercing the DF twice:
- once in Line 670
pandera/schemas.py
. Here, it catches the error and collects the error instead of raising it right away iflazy
- once in Line 205
pandera/schema_components.py
. Here, it doesn't catch the error and raises it right away, even thoughlazy
is True.
I believe coercion only happens once... the DataFrameSchema.coerce_dtype
calls Column.coerce_dtype
here.
The behavior you're seeing is actually intended behavior given pandera's current execution model.
In summary, if a column cannot be coerced to the intended type, in this case float, pandera won't apply any of the downstream Check
s to that column (which is why the nullability check is not picked up in failure_cases
.
The reasoning is that if the column is not even of the correct type, then it's reasonable to assume that the validation checks (which assume that type) wouldn't work on that column.
This is similar to the problem @azhakhan posted about in #874. You can use the custom Float type described here to make sure the string is correctly handled as a null value... after that the nullable = False
check should take effect.
As stated in that issue, I'd definitely support extending the pandera Float
datatype to accept additional arguments so it would consider un-coercible values as null (see here).
I believe coercion only happens once
okay so it looks like coercion does happen twice 😅 this is definitely a bug... will look into correcting that.
Re: the execution model, I'll also look into relaxing this a little bit and see how it feels
@cosmicBboy any updates on the twice coercion bug?
I managed to fix it by commenting out line 670 in pandera/schemas.py
. Thus, coercion only happens once at pandera/schema_components.py
. However, this change changes the order of the failure cases reported (not the actual rows, just the order that the rows are in). Therefore it breaks a lot of the existing tests.