pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Columns containing `bool` and `None` values do not validate correctly

Open xvr-hlt opened this issue 1 year ago • 9 comments
trafficstars

Describe the bug

If I create a pa.DataFrameSchema with a pa.Column(bool, nullable=True), I expect something of the form [None, True] to pass validation, but it does not.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.
  • [ ] (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

import pandas as pd
import pandera as pa

pa.DataFrameSchema({'x': pa.Column(bool, nullable=True)})(pd.DataFrame({'x': [True, None]}))
>>> ...
>>> SchemaError: expected series 'x' to have type bool, got object

Expected behavior

This should pass validation.

xvr-hlt avatar Sep 13 '24 15:09 xvr-hlt

Hi @xvr-hlt, this is not a bug. python None is not a bool, therefore pandas converts that series to object which causes the schema validation to fail. Instead you may want to use the pandas nullable boolean dtype:

pa.DataFrameSchema({"x": pa.Column(pd.BooleanDtype, nullable=True)})(
    pd.DataFrame({"x": [True, pd.NA]}, dtype="boolean")
)

edit: You can also replace pd.NA by None because you give the dtype here explicitly and pandas converts None to pd.NA for you.

Nick-Seinsche avatar Sep 15 '24 17:09 Nick-Seinsche

I understand that None is not a bool, what was confusing to me is that None was invalid for a field with nullable = True.

Additionally, this behaviour is inconsistent: with a str field, None is valid input where nullable=True:

import pandas as pd
import pandera as pa

pa.DataFrameSchema({'x': pa.Column(str, nullable=True)})(pd.DataFrame({'x': ["abc", None]}))

Passes without fail.

xvr-hlt avatar Sep 15 '24 21:09 xvr-hlt

Regarding your second point: Yes, this again is due to pandas. The series of your DataFrame is dtype object. Both "abc" and None are objects and since nullable=True, None is allowed so the test passes.

Regading your first point: The test doesnt fail because nullable = True "doesnt work". Its because you specify the column to be dtype bool, but the dataframe you pass into the schema check has column dtype object so the validation fails.

Nick-Seinsche avatar Sep 16 '24 22:09 Nick-Seinsche

@xvr-hlt dealing with null with the default numpy types is a pain, I'd recommend using the pandas-native nullable dtype:

  • https://pandas.pydata.org/docs/user_guide/boolean.html
  • https://pandas.pydata.org/docs/reference/api/pandas.BooleanDtype.html#pandas.BooleanDtype

pandera's design choice is to delegate behavior to the underlying dataframe library, in this case it inherits the datatype behavior of pandas: have a boolean and None value in a column will be interpreted by pandas as having an object dtype.

cosmicBboy avatar Sep 29 '24 18:09 cosmicBboy

@cosmicBboy I am facing a similar issue using pd.NA. It appears that coercion does not work properly in this case:

    mothballed: Series[bool] = Field(nullable=True)
    "Innactive but not fully retired"

This can contain either True, False or pd.NA.

without coercion, I get

pandera.errors.SchemaError: expected series 'ccs_installed' to have type bool, got object

enabling coercion results in this error:

pandera.errors.SchemaError: Error while coercing 'ccs_installed' to type bool: Could not coerce <class 'pandas.core.series.Series'> data_container into type bool:
     index failure_case
0      145         <NA>
1      146         <NA>
2      148         <NA>
3      149         <NA>
4      225         <NA>
..     ...          ...
765  13917         <NA>
766  13938         <NA>
767  13939         <NA>
768  14055         <NA>
769  14140         <NA>

[770 rows x 2 columns]

is this expected?

irm-codebase avatar Apr 11 '25 10:04 irm-codebase

In particular I want to highlight the difference in behavior between int64s and booleans:

df = pd.DataFrame({'col': [1, pd.NA]}).astype({'col': pd.Int64Dtype()})
class Schema(pa.DataFrameModel):
  col: int = pa.Field(nullable=True)
Schema.validate(df) # OK!
df = pd.DataFrame({'col': [True, pd.NA]}).astype({'col': pd.BooleanDtype()})
class Schema(pa.DataFrameModel):
  col: bool = pa.Field(nullable=True)
Schema.validate(df) # Bang!

The difference seems to be due to the check implementations in dtypes.py.

Int.check() returns True because isinstance(pandas_engine.INT64(), dtypes.Int) is true.

Bool.check() falls through to _Number.check(). numpy_engine.Bool is not _Number, so that goes into DataType.check(), which returns False since numpy_engine.Bool != pandas_engine.BOOL.

axyb avatar May 06 '25 17:05 axyb

If you want nullable bools, use the pd.BooleanDtype type or "boolean" string alias. https://pandas.pydata.org/docs/reference/api/pandas.BooleanDtype.html#pandas.BooleanDtype

Pandera adopts the dtype semantics of the underlying dataframe library, so using bool uses the numpy boolean dtype, which cannot contain None or any null-like value.

cosmicBboy avatar May 06 '25 17:05 cosmicBboy

Pandera adopts the dtype semantics of the underlying dataframe library, so using bool uses the numpy boolean dtype, which cannot contain None or any null-like value.

I'm not sure I understand. The following both raise an exception:

df = pd.DataFrame({'col': [True, pd.NA]}).astype({'col': bool})
df = pd.DataFrame({'col': [1, pd.NA]}).astype({'col': int})

So to me it seems:

  • For a column with nulls, Pandas does not accept bool or int, you must use Pandas dtypes
  • For an pd.Int64Dtype column, Pandera accepts an int annotation
  • For a pd.BooleanDtype column, Pandera does not accept a bool annotation

To me it seems that bools and ints are handled consistently by Pandas and inconsistently by Pandera, but maybe I'm missing something.

axyb avatar May 06 '25 17:05 axyb

For an pd.Int64Dtype column, Pandera accepts an int annotation

This is a bug. This should raise an error

For a pd.BooleanDtype column, Pandera does not accept a bool annotation

This seems to be the right behavior:

df = pd.DataFrame({'col': [True, pd.NA]}).astype({'col': pd.BooleanDtype()})

class Schema(pa.DataFrameModel):
    col: bool = pa.Field(nullable=True)  # this should be of type pd.BooleanDtype

Schema.validate(df) # Bang!

cosmicBboy avatar May 06 '25 17:05 cosmicBboy