pandera Missing invalid values under failure_cases column in err.failure_cases after try-except lazy-evaluating DataFrameSchema.validate(DataFrame, lazy=True)

Hi to whom it concerns

Thanks to the great work by the pandera's team, it really helps me with validating a huge number of datasets.

I encountered the following issue. Context Screenshot: pandera_failure_cases_issue

This is a sample of what I get after running:

try:
    pandera.DataFrameSchema(dataset: pd.DataFrame, lazy=True)
except pandera.errors.SchemaErrors as err:
    failure_cases_df = err.failure_cases

The problem is the NaN values under the failure_cases column of the failure_cases_df.

While I set up a SeriesSchema and validating that same column within the dataset: pd.DataFrame, it identifies correctly which values within the column has invalid values:

SchemaError: <Schema SeriesSchema(name=None, type=DataType(str))> failed element-wise validator 0:
<Check isin: isin({some_sets_of_values})>
failure cases:
   index   failure_case
0    582   A
1    583  B
2    584  B
3    585  C
4    607   C
5    608  B
6    609  B
7    610  B
8    613   D
9    614   E

Please assist, thanks.

Jul 08 '22 07:07 jymchng

hi @jymchng, glad you're finding this useful!

Just to clarify, you're expecting to see actual failure cases ("A", B", "D", etc) as you show in the SeriesSchema validation, but not in the DataFrame validation call?

Would you mind creating a minimally reproducible example of what you're seeing? This will help me debug the issue

Jul 08 '22 13:07 cosmicBboy

Good day to you cosmicBboy,

After investigating it for hours, I think this bug occurs when one sets the Check(... , ignore_na=False, ...) while setting Column(... , nullable=True, ...).

The following codes shall provide a reproducible example.

np.random.seed(42)
FC_CHECK = pa.Check.isin({'A', 'B', 'C'},\
                         ignore_na=False)
FA_CHECK = pa.Check.isin({'X', 'Y', 'Z'},\
                         ignore_na=False)
FC_S = pd.Series([_ for _ in np.random.choice(['A', 'B', 'C', 'E', np.nan], 1000)])
FA_S = pd.Series([_ for _ in np.random.choice(['X', 'Y', 'Z', 'E', np.nan], 1000)])
FC_S.loc[[1,2,3,4,5]] = 'E'
FA_S.loc[[1,2,3,4,5]] = 'W'
DF = pd.concat([FC_S, FA_S], axis=1)
DF.columns = ['COLA', 'COLB']

Above code produces: COLA COLB 0 E E 1 E W 2 E W 3 E W 4 E W ... ... ... 995 B X 996 A nan 997 A X 998 E nan 999 C Z

DF_SCHEMA = pa.DataFrameSchema(columns={
    'COLA': pa.Column(str, checks=[FC_CHECK], nullable=True, coerce=True),
    'COLB': pa.Column(str, checks=[FA_CHECK], nullable=False, coerce=True)
})

try:
    DF_SCHEMA.validate(DF, lazy=True)
    
except (pa.errors.SchemaErrors) as err:
    print("Has invalid entries.")
    EFC = err.failure_cases

Above code produces:

schema_context	column	check	failure_case	index
Column	COLA	isin({'B', 'A', 'C'})	E	0
Column	COLA	isin({'B', 'A', 'C'})	E	1
Column	COLB	isin({'Y', 'Z', 'X'})	nan	16
Column	COLB	isin({'Y', 'Z', 'X'})	nan	12
Column	COLB	isin({'Y', 'Z', 'X'})	E	8
Column	COLB	isin({'Y', 'Z', 'X'})	W	5
Column	COLB	isin({'Y', 'Z', 'X'})	W	4
Column	COLB	isin({'Y', 'Z', 'X'})	W	3
Column	COLB	isin({'Y', 'Z', 'X'})	W	2
Column	COLB	isin({'Y', 'Z', 'X'})	W	1
Column	COLB	isin({'Y', 'Z', 'X'})	E	0
Column	COLA	isin({'B', 'A', 'C'})	E	14
Column	COLA	isin({'B', 'A', 'C'})	nan	12
Column	COLA	isin({'B', 'A', 'C'})	E	10
Column	COLA	isin({'B', 'A', 'C'})	nan	9
Column	COLA	isin({'B', 'A', 'C'})	E	5
Column	COLA	isin({'B', 'A', 'C'})	E	4
Column	COLA	isin({'B', 'A', 'C'})	E	3
Column	COLA	isin({'B', 'A', 'C'})	E	2
Column	COLB	isin({'Y', 'Z', 'X'})	nan	17

The problems are two: 1) From the above code snippets, it seems that not all instances of 'E' is being captured and 2) not all instances of 'nan' is captured, despite setting ignore_na=False and nullable=True.

I resolved my problem by setting ignore_na=True for Column(...) and nullable=True/False depending on the configurations, and for now it seems to be working.

Jul 08 '22 15:07 jymchng

Anyway, just a further comment which is I'm attempting to dynamically generate a pandera's DataFrameSchema (on-the-fly) to validate many datasets. I think pandera is useful as it is lightweight and easy to start, yet effective.

Jul 08 '22 15:07 jymchng

pandera pandera copied to clipboard

Missing invalid values under failure_cases column in err.failure_cases after try-except lazy-evaluating DataFrameSchema.validate(DataFrame, lazy=True)

pandera
pandera copied to clipboard