Conditional checks
Is your feature request related to a problem? Please describe. I would like to be able to make a different check on a column depending on the value of another column.
I have currently the following code:
def check_issue(data, schema_pa):
'''
This functions takes a schema from pandera and extracts all the record_ids that are problematic
----------------------------------------
Input:
- data: a fataframe containing the data to check (pd.DataFrame)
- schema_pa: a pandera schema (pa.DataFrameSchema)
Output:
- ids_err: a list ocntaining the ids which do not comply to the schema
'''
ids_err = []
df = data.copy()
while (True): #Necessary since failure_cases only returns the 10 first errors
try:
schema(df, lazy = True)
break
except pa.errors.SchemaErrors as err:
temp_err = err.failure_cases["index"].tolist() #Retrieve ids with the issue
ids_err = ids_err + temp_err #Total list IDs with the issue
df = df[~df.index.isin(temp_err)] #Updates so we can get new ids where the error is
return (ids_err)
####################
schema = pa.DataFrameSchema(
columns = {
"b": pa.Column(int,
coerce = True,
nullable = False)
}
)
err_ids = check_issue(data.query("a != 0"),
schema)
which forces me to subset my dataframe for different values of columns a before performing a check. I have found no ways of performing a conditional check, i.e. if column a == 0 then it's ok if column b is null, otherwise it's not, whether in the documentation nor StackOverflow or the issues here. Including it in a check directly would allow for a common schema to be used on the entire dataframe rather than a subset of the dataframe.
Describe the solution you'd like
I would like to have a conditional check, which could be a subclass of Check, which could work similarly to the if_else() method in R:
DataFrameSchema({
"col": pa.Column("type", checks = [if_else(condition, check_if_true, check_if_false)])]
})
This is an example at the column level but it could also be used at other levels.
Describe alternatives you've considered Subsetting dataframes based on the condition and perform separate checks. I tried creating a custom Check too following the documentation (in particular the section "wide Checks" which uses several column) but did not manage to come up with something satisfactory
I may have overlooked something or misunderstood how to use custom Checks, in which case I apologise and would really like some feedback on how to perform this task.
Thank you very much
Hi @PezAmaury a built-in way to do conditional checks at the column-level would for sure be useful, let's keep this issue open because I have a few ideas on how to implement this holistically (need to also consider feature parity with the SchemaModel API)
One way you can do this today would be to use wide checks, as follows:
import pandas as pd
import pandera as pa
def conditional_check(df: pd.DataFrame) -> pd.Series:
"""Check that if column a == 0 then it's ok if column b is null, otherwise it's not.
Notice that the output of this function is a boolean Series that's index-aligned with the
original dataframe. This produces the most informative error message, pinpointing where
exactly in the dataframe the error occurred.
"""
# create a boolean Series indicating which values are null in column "b"
b_is_null = df["b"].isna()
# replace entries in `b_is_null` with `True` if column "a" is 0, meaning that
# checks always pass when "a" == 0
return b_is_null.where(df["a"] == 0, True)
schema = pa.DataFrameSchema(
columns = {
"a": pa.Column(int, pa.Check.isin([0, 1])),
# using the pandas-native nullable integer dtype
"b": pa.Column(pd.Int64Dtype(), nullable=True),
},
# ignore_na=False is important! makes sure pandera doesn't ignore null entries
checks = pa.Check(conditional_check, ignore_na=False),
coerce = True,
)
print(schema(pd.DataFrame({"a": [0, 1], "b": [None, 1]})))
conditional_check can even be reduced to an inline lambda if you like:
schema = pa.DataFrameSchema(
columns = {
"a": pa.Column(int, pa.Check.isin([0, 1])),
# using the pandas-native nullable integer dtype
"b": pa.Column(pd.Int64Dtype(), nullable=True),
},
checks = pa.Check(lambda df: df["b"].notna().where(df["a"] == 0, True), ignore_na=False),
coerce = True,
)
Hi @cosmicBboy
Thank you very much for your message and your workaround. I will use it for now and will look forward to the native implementation :)