pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Set default Check n_failure_cases to None and add documentation for n_failure_cases behavior

Open jgirault-qs opened this issue 3 years ago • 5 comments

Describe the bug pa.errors.SchemaErrors.failure_cases only returns the first 10 failure_cases

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera. 0.6.5
  • [ ] (optional) I have confirmed this bug exists on the master branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa

df = pd.DataFrame({'n': range(20)})
schema = pa.DataFrameSchema({
    "n": pa.Column(pa.Int, pa.Check.greater_than(30))
})
try:
    schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
    print(err.failure_cases)

Expected behavior

err.failure_cases should be 20 lines long

Usage

Using pandera for data validation at the start of data processing. When errors appear I want the clean data to be able to continue the pipeline, while corrupted data is removed from the dataframe and indexes stored somewhere so I can fix the issue and plan a recovery pipeline run later. Was trying the following code when I discovered the issue. (I've got MultiIndexes)

import pandas as pd
import pandera as pa

df = pd.DataFrame({'n': range(20), 'a': range(20), 'b': range(20)}).set_index(['a', 'b'])
schema = pa.DataFrameSchema({
   "n": pa.Column(pa.Int, pa.Check.greater_than(30))
})

try:
   schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
   f = err.failure_cases  # dataframe of schema errors
   d = err.data  # invalid dataframe
   df_failures = pd.DataFrame(err.failure_cases['index'].apply(literal_eval).values.tolist(), columns=err.data.index.names)
   print(df_failures)

df_failures could contain the indexes that I need to remove the corrupted data from the dataframe and store to prepare the recovery run, but it only has the first 10.

jgirault-qs avatar Jul 23 '21 09:07 jgirault-qs

Update: looked a bit deeper into the code of pandera, turns out it's the variable pandera.constants.N_FAILURE_CASES that decides this behaviour. Manually setting the value to None made all the indexes show up as expected.

However, I think this constant should be configurable (turn this issue into an improvement maybe) and its behaviour should be made explicit in the doc https://pandera.readthedocs.io/en/stable/lazy_validation.html. I'm also noting there's a def Field(*, n_failure_cases: int = 10) that does not rely on the constant.

Update 2: Found a better way to do it. Couldn't find any documentation talking about this however.

schema = pa.DataFrameSchema({
   "n": pa.Column(pa.Int, pa.Check.greater_than(30, n_failure_cases = None))
})

jgirault-qs avatar Jul 23 '21 10:07 jgirault-qs

hi @jgirault-qs thanks for pointing this out! there are two things here to improve the experience:

  1. add documentation to document the behavior of n_failure_cases
  2. set the default for n_failure_cases to None.

Let me know if you're interested in making a contribution to any one (or both 😃 ) of these items!

cosmicBboy avatar Jul 24 '21 21:07 cosmicBboy

I would love to pick this up

Patil2099 avatar Jul 25 '21 11:07 Patil2099

cool, thanks @Patil2099! Head over the the contributing guide to see how to set up your dev environment

cosmicBboy avatar Jul 25 '21 15:07 cosmicBboy

Hey @Patil2099 just checking in: are you still able to tackle this issue?

cosmicBboy avatar Sep 06 '21 16:09 cosmicBboy

Hi @cosmicBboy, I wanted to pick this up and looked through the code base and I think this is already done. Can we close this issue?

azhar316 avatar Sep 10 '23 06:09 azhar316