pandera
pandera copied to clipboard
Missing invalid values under failure_cases column in err.failure_cases after try-except lazy-evaluating DataFrameSchema.validate(DataFrame, lazy=True)
Hi to whom it concerns
Thanks to the great work by the pandera's team, it really helps me with validating a huge number of datasets.
I encountered the following issue.
Context Screenshot:
This is a sample of what I get after running:
try:
pandera.DataFrameSchema(dataset: pd.DataFrame, lazy=True)
except pandera.errors.SchemaErrors as err:
failure_cases_df = err.failure_cases
The problem is the NaN values under the failure_cases
column of the failure_cases_df
.
While I set up a SeriesSchema and validating that same column within the dataset: pd.DataFrame, it identifies correctly which values within the column has invalid values:
SchemaError: <Schema SeriesSchema(name=None, type=DataType(str))> failed element-wise validator 0:
<Check isin: isin({some_sets_of_values})>
failure cases:
index failure_case
0 582 A
1 583 B
2 584 B
3 585 C
4 607 C
5 608 B
6 609 B
7 610 B
8 613 D
9 614 E
Please assist, thanks.
hi @jymchng, glad you're finding this useful!
Just to clarify, you're expecting to see actual failure cases ("A", B", "D", etc) as you show in the SeriesSchema validation, but not in the DataFrame validation call?
Would you mind creating a minimally reproducible example of what you're seeing? This will help me debug the issue
Good day to you cosmicBboy,
After investigating it for hours, I think this bug occurs when one sets the Check(... , ignore_na=False, ...) while setting Column(... , nullable=True, ...).
The following codes shall provide a reproducible example.
np.random.seed(42)
FC_CHECK = pa.Check.isin({'A', 'B', 'C'},\
ignore_na=False)
FA_CHECK = pa.Check.isin({'X', 'Y', 'Z'},\
ignore_na=False)
FC_S = pd.Series([_ for _ in np.random.choice(['A', 'B', 'C', 'E', np.nan], 1000)])
FA_S = pd.Series([_ for _ in np.random.choice(['X', 'Y', 'Z', 'E', np.nan], 1000)])
FC_S.loc[[1,2,3,4,5]] = 'E'
FA_S.loc[[1,2,3,4,5]] = 'W'
DF = pd.concat([FC_S, FA_S], axis=1)
DF.columns = ['COLA', 'COLB']
Above code produces: COLA COLB 0 E E 1 E W 2 E W 3 E W 4 E W ... ... ... 995 B X 996 A nan 997 A X 998 E nan 999 C Z
DF_SCHEMA = pa.DataFrameSchema(columns={
'COLA': pa.Column(str, checks=[FC_CHECK], nullable=True, coerce=True),
'COLB': pa.Column(str, checks=[FA_CHECK], nullable=False, coerce=True)
})
try:
DF_SCHEMA.validate(DF, lazy=True)
except (pa.errors.SchemaErrors) as err:
print("Has invalid entries.")
EFC = err.failure_cases
Above code produces:
schema_context | column | check | check_number | failure_case | index |
---|---|---|---|---|---|
Column | COLA | isin({'B', 'A', 'C'}) | 0 | E | 0 |
Column | COLA | isin({'B', 'A', 'C'}) | 0 | E | 1 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | nan | 16 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | nan | 12 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | E | 8 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | W | 5 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | W | 4 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | W | 3 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | W | 2 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | W | 1 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | E | 0 |
Column | COLA | isin({'B', 'A', 'C'}) | 0 | E | 14 |
Column | COLA | isin({'B', 'A', 'C'}) | 0 | nan | 12 |
Column | COLA | isin({'B', 'A', 'C'}) | 0 | E | 10 |
Column | COLA | isin({'B', 'A', 'C'}) | 0 | nan | 9 |
Column | COLA | isin({'B', 'A', 'C'}) | 0 | E | 5 |
Column | COLA | isin({'B', 'A', 'C'}) | 0 | E | 4 |
Column | COLA | isin({'B', 'A', 'C'}) | 0 | E | 3 |
Column | COLA | isin({'B', 'A', 'C'}) | 0 | E | 2 |
Column | COLB | isin({'Y', 'Z', 'X'}) | 0 | nan | 17 |
The problems are two: 1) From the above code snippets, it seems that not all instances of 'E' is being captured and 2) not all instances of 'nan' is captured, despite setting ignore_na=False and nullable=True.
I resolved my problem by setting ignore_na=True for Column(...) and nullable=True/False depending on the configurations, and for now it seems to be working.
Anyway, just a further comment which is I'm attempting to dynamically generate a pandera's DataFrameSchema (on-the-fly) to validate many datasets. I think pandera is useful as it is lightweight and easy to start, yet effective.