pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Ability to make checks with lambda functions on nullable int columns (Pandas' `Int64`)

Open cdagnino opened this issue 3 years ago • 5 comments

Suppose I have a dataframe with a nullable int column (Int64 in pandas instead of the non-nullable int64) and a nullable float column:

import pandas
import numpy as np
from pandera import (
    DataFrameSchema,
    Column,
    Check,
    Index,
    PandasDtype,
)

df = pd.DataFrame({'x1': [2, 3, np.nan], 'x2': [1.0, np.nan, 2.1]})
df['x1'] = df['x1'].astype('Int64')

Which gives

x1 x2
0 2 1.0
1 3 NaN
2 <NA> 2.1

If I try to use a lambda function to check the x1 column

test_schema = DataFrameSchema(
    columns={
        "x1": Column(
            pandas_dtype=PandasDtype.INT64,
            allow_duplicates=False,
            nullable=True,
            checks=Check(lambda x: 0 <= x <= 3, element_wise=True)
        ),
        "x2": Column(
    pandas_dtype=PandasDtype.Float64,
            nullable=True,
            checks=Check(lambda x: 0.0 <= x <= 3, element_wise=True)
)},
    coerce=False,
    strict=False,
    name="test_schema")

test_schema.validate(df)

then I get a not so informative error message. In particular, I don't get informed which column or row was the offender:

SchemaErrorTraceback (most recent call last)
<ipython-input-41-7dbd5edc716e> in <module>
     30     name="test_schema")
     31 
---> 32 test_schema.validate(df)

/opt/conda/lib/python3.7/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    590                 check_results.append(isinstance(result, pd.DataFrame))
    591             except errors.SchemaError as err:
--> 592                 error_handler.collect_error("schema_component_check", err)
    593             except errors.SchemaErrors as err:
    594                 for schema_error_dict in err.schema_errors:

/opt/conda/lib/python3.7/site-packages/pandera/error_handlers.py in collect_error(self, reason_code, schema_error, original_exc)
     30         """
     31         if not self._lazy:
---> 32             raise schema_error from original_exc
     33 
     34         # delete data of validated object from SchemaError object to prevent

/opt/conda/lib/python3.7/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    586                     lazy=lazy if schema_component.has_subcomponents else None,
    587                     # don't make a copy of the data
--> 588                     inplace=True,
    589                 )
    590                 check_results.append(isinstance(result, pd.DataFrame))

/opt/conda/lib/python3.7/site-packages/pandera/schemas.py in __call__(self, check_obj, head, tail, sample, random_state, lazy, inplace)
   1884         """Alias for ``validate`` method."""
   1885         return self.validate(
-> 1886             check_obj, head, tail, sample, random_state, lazy, inplace
   1887         )
   1888 

/opt/conda/lib/python3.7/site-packages/pandera/schema_components.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
    209                     )
    210             else:
--> 211                 validate_column(check_obj, column_name)
    212 
    213         return check_obj

/opt/conda/lib/python3.7/site-packages/pandera/schema_components.py in validate_column(check_obj, column_name)
    189                 random_state,
    190                 lazy,
--> 191                 inplace=inplace,
    192             )
    193 

/opt/conda/lib/python3.7/site-packages/pandera/schemas.py in validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
   1861                         check_index=check_index,
   1862                     ),
-> 1863                     original_exc=err,
   1864                 )
   1865 

/opt/conda/lib/python3.7/site-packages/pandera/error_handlers.py in collect_error(self, reason_code, schema_error, original_exc)
     30         """
     31         if not self._lazy:
---> 32             raise schema_error from original_exc
     33 
     34         # delete data of validated object from SchemaError object to prevent

SchemaError: Error while executing check function: TypeError("boolean value of NA is ambiguous")

Note that it DOES work if I check the Int64 column with the equivalentCheck.in_range(0, 3) instead

Describe the solution you'd like

The ability to check an Int64 (nullable integer column) with a lambda function, just like Float64 can be checked.

Describe alternatives you've considered

If this is not possible, then the error message could be improved, hopefully to include that Int64 can't be checked with lambda functions. Otherwise include this gotcha in the documentation (Maybe it's already there, but I couldn't find it 😬)

cdagnino avatar Jun 16 '21 14:06 cdagnino

hey @cdagnino, I'd recommend using built-in checks or explicitly handle na case in lambda function element_wise=True. The issue you're coming up against is happening because the element-wise function doesn't know how to operate on null values.

Some solutions are:

Check(lambda x: pd.notna(x) and (0 <= x <= 3), element_wise=True)  # if you want to use element-wise
Check(lambda s: s.between(0, 3))  # vectorized check
Check.in_range(0, 3)  # built-in pandera check

cosmicBboy avatar Jun 20 '21 20:06 cosmicBboy

hi @cosmicBboy , thanks for the solutions! What do you think should be the result of this issue though? Should the answer be that there won't be a change, but maybe the error message should be improved? Or maybe just a mention in the docs?

cdagnino avatar Jun 22 '21 16:06 cdagnino

I have a similar problem. I am trying to check a float column. However, the column can have strings and null. But null are accepted, it shouldn't give an error. So, my question would be: how can I make a Check that accepts a column that has numbers between -90 and 90, it also accepts null (it is ok to have empty fields), but it can not have strings?

I tried doing a helper function, but the issue is that if the function fails, Pandera doesn't return any error. The clearer example is if I do float(x). If x is empty, the python error is "ValueError: could not convert string to float: ''". However, Pandera doesn't crash, it continues to run.

parayamelo avatar Sep 01 '21 09:09 parayamelo

hi @parayamelo can you provide a minimally reproducible example of the behavior you're seeing?

cosmicBboy avatar Sep 01 '21 14:09 cosmicBboy

@cdagnino yes I think a better error message would help here. The output of an element-wise Check function must always a boolean, so outputing an NA value should not be allowed.

Basically, right here there should only be False or True values, otherwise a SchemaDefinitionError should be raised.

cosmicBboy avatar Sep 01 '21 16:09 cosmicBboy