pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Missing invalid values under failure_cases column in err.failure_cases after try-except lazy-evaluating DataFrameSchema.validate(DataFrame, lazy=True)

Open jymchng opened this issue 2 years ago • 3 comments

Hi to whom it concerns

Thanks to the great work by the pandera's team, it really helps me with validating a huge number of datasets.

I encountered the following issue. Context Screenshot: pandera_failure_cases_issue

This is a sample of what I get after running:

    pandera.DataFrameSchema(dataset: pd.DataFrame, lazy=True)
except pandera.errors.SchemaErrors as err:
    failure_cases_df = err.failure_cases

The problem is the NaN values under the failure_cases column of the failure_cases_df.

While I set up a SeriesSchema and validating that same column within the dataset: pd.DataFrame, it identifies correctly which values within the column has invalid values:

SchemaError: <Schema SeriesSchema(name=None, type=DataType(str))> failed element-wise validator 0:
<Check isin: isin({some_sets_of_values})>
failure cases:
   index   failure_case
0    582   A
1    583  B
2    584  B
3    585  C
4    607   C
5    608  B
6    609  B
7    610  B
8    613   D
9    614   E

Please assist, thanks.

jymchng avatar Jul 08 '22 07:07 jymchng

hi @jymchng, glad you're finding this useful!

Just to clarify, you're expecting to see actual failure cases ("A", B", "D", etc) as you show in the SeriesSchema validation, but not in the DataFrame validation call?

Would you mind creating a minimally reproducible example of what you're seeing? This will help me debug the issue

cosmicBboy avatar Jul 08 '22 13:07 cosmicBboy

Good day to you cosmicBboy,

After investigating it for hours, I think this bug occurs when one sets the Check(... , ignore_na=False, ...) while setting Column(... , nullable=True, ...).

The following codes shall provide a reproducible example.

FC_CHECK = pa.Check.isin({'A', 'B', 'C'},\
FA_CHECK = pa.Check.isin({'X', 'Y', 'Z'},\
FC_S = pd.Series([_ for _ in np.random.choice(['A', 'B', 'C', 'E', np.nan], 1000)])
FA_S = pd.Series([_ for _ in np.random.choice(['X', 'Y', 'Z', 'E', np.nan], 1000)])
FC_S.loc[[1,2,3,4,5]] = 'E'
FA_S.loc[[1,2,3,4,5]] = 'W'
DF = pd.concat([FC_S, FA_S], axis=1)
DF.columns = ['COLA', 'COLB']

Above code produces: COLA COLB 0 E E 1 E W 2 E W 3 E W 4 E W ... ... ... 995 B X 996 A nan 997 A X 998 E nan 999 C Z

DF_SCHEMA = pa.DataFrameSchema(columns={
    'COLA': pa.Column(str, checks=[FC_CHECK], nullable=True, coerce=True),
    'COLB': pa.Column(str, checks=[FA_CHECK], nullable=False, coerce=True)

    DF_SCHEMA.validate(DF, lazy=True)
except (pa.errors.SchemaErrors) as err:
    print("Has invalid entries.")
    EFC = err.failure_cases

Above code produces:

schema_context column check check_number failure_case index
Column COLA isin({'B', 'A', 'C'}) 0 E 0
Column COLA isin({'B', 'A', 'C'}) 0 E 1
Column COLB isin({'Y', 'Z', 'X'}) 0 nan 16
Column COLB isin({'Y', 'Z', 'X'}) 0 nan 12
Column COLB isin({'Y', 'Z', 'X'}) 0 E 8
Column COLB isin({'Y', 'Z', 'X'}) 0 W 5
Column COLB isin({'Y', 'Z', 'X'}) 0 W 4
Column COLB isin({'Y', 'Z', 'X'}) 0 W 3
Column COLB isin({'Y', 'Z', 'X'}) 0 W 2
Column COLB isin({'Y', 'Z', 'X'}) 0 W 1
Column COLB isin({'Y', 'Z', 'X'}) 0 E 0
Column COLA isin({'B', 'A', 'C'}) 0 E 14
Column COLA isin({'B', 'A', 'C'}) 0 nan 12
Column COLA isin({'B', 'A', 'C'}) 0 E 10
Column COLA isin({'B', 'A', 'C'}) 0 nan 9
Column COLA isin({'B', 'A', 'C'}) 0 E 5
Column COLA isin({'B', 'A', 'C'}) 0 E 4
Column COLA isin({'B', 'A', 'C'}) 0 E 3
Column COLA isin({'B', 'A', 'C'}) 0 E 2
Column COLB isin({'Y', 'Z', 'X'}) 0 nan 17

The problems are two: 1) From the above code snippets, it seems that not all instances of 'E' is being captured and 2) not all instances of 'nan' is captured, despite setting ignore_na=False and nullable=True.

I resolved my problem by setting ignore_na=True for Column(...) and nullable=True/False depending on the configurations, and for now it seems to be working.

jymchng avatar Jul 08 '22 15:07 jymchng

Anyway, just a further comment which is I'm attempting to dynamically generate a pandera's DataFrameSchema (on-the-fly) to validate many datasets. I think pandera is useful as it is lightweight and easy to start, yet effective.

jymchng avatar Jul 08 '22 15:07 jymchng