pandera
pandera copied to clipboard
Ability to make checks with lambda functions on nullable int columns (Pandas' `Int64`)
Suppose I have a dataframe with a nullable int
column (Int64
in pandas instead of the non-nullable int64
)
and a nullable float
column:
import pandas
import numpy as np
from pandera import (
DataFrameSchema,
Column,
Check,
Index,
PandasDtype,
)
df = pd.DataFrame({'x1': [2, 3, np.nan], 'x2': [1.0, np.nan, 2.1]})
df['x1'] = df['x1'].astype('Int64')
Which gives
x1 | x2 | |
---|---|---|
0 | 2 | 1.0 |
1 | 3 | NaN |
2 | <NA> | 2.1 |
If I try to use a lambda
function to check the x1
column
test_schema = DataFrameSchema(
columns={
"x1": Column(
pandas_dtype=PandasDtype.INT64,
allow_duplicates=False,
nullable=True,
checks=Check(lambda x: 0 <= x <= 3, element_wise=True)
),
"x2": Column(
pandas_dtype=PandasDtype.Float64,
nullable=True,
checks=Check(lambda x: 0.0 <= x <= 3, element_wise=True)
)},
coerce=False,
strict=False,
name="test_schema")
test_schema.validate(df)
then I get a not so informative error message. In particular, I don't get informed which column or row was the offender:
SchemaErrorTraceback (most recent call last)
<ipython-input-41-7dbd5edc716e> in <module>
30 name="test_schema")
31
---> 32 test_schema.validate(df)
/opt/conda/lib/python3.7/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
590 check_results.append(isinstance(result, pd.DataFrame))
591 except errors.SchemaError as err:
--> 592 error_handler.collect_error("schema_component_check", err)
593 except errors.SchemaErrors as err:
594 for schema_error_dict in err.schema_errors:
/opt/conda/lib/python3.7/site-packages/pandera/error_handlers.py in collect_error(self, reason_code, schema_error, original_exc)
30 """
31 if not self._lazy:
---> 32 raise schema_error from original_exc
33
34 # delete data of validated object from SchemaError object to prevent
/opt/conda/lib/python3.7/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
586 lazy=lazy if schema_component.has_subcomponents else None,
587 # don't make a copy of the data
--> 588 inplace=True,
589 )
590 check_results.append(isinstance(result, pd.DataFrame))
/opt/conda/lib/python3.7/site-packages/pandera/schemas.py in __call__(self, check_obj, head, tail, sample, random_state, lazy, inplace)
1884 """Alias for ``validate`` method."""
1885 return self.validate(
-> 1886 check_obj, head, tail, sample, random_state, lazy, inplace
1887 )
1888
/opt/conda/lib/python3.7/site-packages/pandera/schema_components.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
209 )
210 else:
--> 211 validate_column(check_obj, column_name)
212
213 return check_obj
/opt/conda/lib/python3.7/site-packages/pandera/schema_components.py in validate_column(check_obj, column_name)
189 random_state,
190 lazy,
--> 191 inplace=inplace,
192 )
193
/opt/conda/lib/python3.7/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
1861 check_index=check_index,
1862 ),
-> 1863 original_exc=err,
1864 )
1865
/opt/conda/lib/python3.7/site-packages/pandera/error_handlers.py in collect_error(self, reason_code, schema_error, original_exc)
30 """
31 if not self._lazy:
---> 32 raise schema_error from original_exc
33
34 # delete data of validated object from SchemaError object to prevent
SchemaError: Error while executing check function: TypeError("boolean value of NA is ambiguous")
Note that it DOES work if I check the Int64
column with the equivalentCheck.in_range(0, 3)
instead
Describe the solution you'd like
The ability to check an Int64
(nullable integer column) with a lambda function
, just like Float64
can be checked.
Describe alternatives you've considered
If this is not possible, then the error message could be improved, hopefully to include that Int64
can't be checked with lambda functions. Otherwise include this gotcha in the documentation (Maybe it's already there, but I couldn't find it 😬)
hey @cdagnino, I'd recommend using built-in checks or explicitly handle na case in lambda function element_wise=True
. The issue you're coming up against is happening because the element-wise function doesn't know how to operate on null values.
Some solutions are:
Check(lambda x: pd.notna(x) and (0 <= x <= 3), element_wise=True) # if you want to use element-wise
Check(lambda s: s.between(0, 3)) # vectorized check
Check.in_range(0, 3) # built-in pandera check
hi @cosmicBboy , thanks for the solutions! What do you think should be the result of this issue though? Should the answer be that there won't be a change, but maybe the error message should be improved? Or maybe just a mention in the docs?
I have a similar problem. I am trying to check a float column. However, the column can have strings and null. But null are accepted, it shouldn't give an error. So, my question would be: how can I make a Check that accepts a column that has numbers between -90 and 90, it also accepts null (it is ok to have empty fields), but it can not have strings?
I tried doing a helper function, but the issue is that if the function fails, Pandera doesn't return any error. The clearer example is if I do float(x). If x is empty, the python error is "ValueError: could not convert string to float: ''". However, Pandera doesn't crash, it continues to run.
hi @parayamelo can you provide a minimally reproducible example of the behavior you're seeing?
@cdagnino yes I think a better error message would help here. The output of an element-wise Check
function must always a boolean, so outputing an NA value should not be allowed.
Basically, right here there should only be False
or True
values, otherwise a SchemaDefinitionError should be raised.