pandera
pandera copied to clipboard
Set default Check n_failure_cases to None and add documentation for n_failure_cases behavior
Describe the bug pa.errors.SchemaErrors.failure_cases only returns the first 10 failure_cases
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera. 0.6.5
- [ ] (optional) I have confirmed this bug exists on the master branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pandas as pd
import pandera as pa
df = pd.DataFrame({'n': range(20)})
schema = pa.DataFrameSchema({
"n": pa.Column(pa.Int, pa.Check.greater_than(30))
})
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
print(err.failure_cases)
Expected behavior
err.failure_cases should be 20 lines long
Usage
Using pandera for data validation at the start of data processing. When errors appear I want the clean data to be able to continue the pipeline, while corrupted data is removed from the dataframe and indexes stored somewhere so I can fix the issue and plan a recovery pipeline run later. Was trying the following code when I discovered the issue. (I've got MultiIndexes)
import pandas as pd
import pandera as pa
df = pd.DataFrame({'n': range(20), 'a': range(20), 'b': range(20)}).set_index(['a', 'b'])
schema = pa.DataFrameSchema({
"n": pa.Column(pa.Int, pa.Check.greater_than(30))
})
try:
schema.validate(df, lazy=True)
except pa.errors.SchemaErrors as err:
f = err.failure_cases # dataframe of schema errors
d = err.data # invalid dataframe
df_failures = pd.DataFrame(err.failure_cases['index'].apply(literal_eval).values.tolist(), columns=err.data.index.names)
print(df_failures)
df_failures could contain the indexes that I need to remove the corrupted data from the dataframe and store to prepare the recovery run, but it only has the first 10.
Update: looked a bit deeper into the code of pandera, turns out it's the variable pandera.constants.N_FAILURE_CASES
that decides this behaviour. Manually setting the value to None
made all the indexes show up as expected.
However, I think this constant should be configurable (turn this issue into an improvement maybe) and its behaviour should be made explicit in the doc https://pandera.readthedocs.io/en/stable/lazy_validation.html.
I'm also noting there's a def Field(*, n_failure_cases: int = 10)
that does not rely on the constant.
Update 2: Found a better way to do it. Couldn't find any documentation talking about this however.
schema = pa.DataFrameSchema({
"n": pa.Column(pa.Int, pa.Check.greater_than(30, n_failure_cases = None))
})
hi @jgirault-qs thanks for pointing this out! there are two things here to improve the experience:
- add documentation to document the behavior of
n_failure_cases
- set the default for
n_failure_cases
to None.
Let me know if you're interested in making a contribution to any one (or both 😃 ) of these items!
I would love to pick this up
cool, thanks @Patil2099! Head over the the contributing guide to see how to set up your dev environment
Hey @Patil2099 just checking in: are you still able to tackle this issue?
Hi @cosmicBboy, I wanted to pick this up and looked through the code base and I think this is already done. Can we close this issue?