pandera
pandera copied to clipboard
Columns containing `bool` and `None` values do not validate correctly
Describe the bug
If I create a pa.DataFrameSchema with a pa.Column(bool, nullable=True), I expect something of the form [None, True] to pass validation, but it does not.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [ ] (optional) I have confirmed this bug exists on the main branch of pandera.
Code Sample, a copy-pastable example
import pandas as pd
import pandera as pa
pa.DataFrameSchema({'x': pa.Column(bool, nullable=True)})(pd.DataFrame({'x': [True, None]}))
>>> ...
>>> SchemaError: expected series 'x' to have type bool, got object
Expected behavior
This should pass validation.
Hi @xvr-hlt, this is not a bug. python None is not a bool, therefore pandas converts that series to object which causes the schema validation to fail. Instead you may want to use the pandas nullable boolean dtype:
pa.DataFrameSchema({"x": pa.Column(pd.BooleanDtype, nullable=True)})(
pd.DataFrame({"x": [True, pd.NA]}, dtype="boolean")
)
edit: You can also replace pd.NA by None because you give the dtype here explicitly and pandas converts None to pd.NA for you.
I understand that None is not a bool, what was confusing to me is that None was invalid for a field with nullable = True.
Additionally, this behaviour is inconsistent: with a str field, None is valid input where nullable=True:
import pandas as pd
import pandera as pa
pa.DataFrameSchema({'x': pa.Column(str, nullable=True)})(pd.DataFrame({'x': ["abc", None]}))
Passes without fail.
Regarding your second point: Yes, this again is due to pandas. The series of your DataFrame is dtype object. Both "abc" and None are objects and since nullable=True, None is allowed so the test passes.
Regading your first point: The test doesnt fail because nullable = True "doesnt work". Its because you specify the column to be dtype bool, but the dataframe you pass into the schema check has column dtype object so the validation fails.
@xvr-hlt dealing with null with the default numpy types is a pain, I'd recommend using the pandas-native nullable dtype:
- https://pandas.pydata.org/docs/user_guide/boolean.html
- https://pandas.pydata.org/docs/reference/api/pandas.BooleanDtype.html#pandas.BooleanDtype
pandera's design choice is to delegate behavior to the underlying dataframe library, in this case it inherits the datatype behavior of pandas: have a boolean and None value in a column will be interpreted by pandas as having an object dtype.
@cosmicBboy I am facing a similar issue using pd.NA.
It appears that coercion does not work properly in this case:
mothballed: Series[bool] = Field(nullable=True)
"Innactive but not fully retired"
This can contain either True, False or pd.NA.
without coercion, I get
pandera.errors.SchemaError: expected series 'ccs_installed' to have type bool, got object
enabling coercion results in this error:
pandera.errors.SchemaError: Error while coercing 'ccs_installed' to type bool: Could not coerce <class 'pandas.core.series.Series'> data_container into type bool:
index failure_case
0 145 <NA>
1 146 <NA>
2 148 <NA>
3 149 <NA>
4 225 <NA>
.. ... ...
765 13917 <NA>
766 13938 <NA>
767 13939 <NA>
768 14055 <NA>
769 14140 <NA>
[770 rows x 2 columns]
is this expected?
In particular I want to highlight the difference in behavior between int64s and booleans:
df = pd.DataFrame({'col': [1, pd.NA]}).astype({'col': pd.Int64Dtype()})
class Schema(pa.DataFrameModel):
col: int = pa.Field(nullable=True)
Schema.validate(df) # OK!
df = pd.DataFrame({'col': [True, pd.NA]}).astype({'col': pd.BooleanDtype()})
class Schema(pa.DataFrameModel):
col: bool = pa.Field(nullable=True)
Schema.validate(df) # Bang!
The difference seems to be due to the check implementations in dtypes.py.
Int.check() returns True because isinstance(pandas_engine.INT64(), dtypes.Int) is true.
Bool.check() falls through to _Number.check(). numpy_engine.Bool is not _Number, so that goes into DataType.check(), which returns False since numpy_engine.Bool != pandas_engine.BOOL.
If you want nullable bools, use the pd.BooleanDtype type or "boolean" string alias. https://pandas.pydata.org/docs/reference/api/pandas.BooleanDtype.html#pandas.BooleanDtype
Pandera adopts the dtype semantics of the underlying dataframe library, so using bool uses the numpy boolean dtype, which cannot contain None or any null-like value.
Pandera adopts the dtype semantics of the underlying dataframe library, so using bool uses the numpy boolean dtype, which cannot contain None or any null-like value.
I'm not sure I understand. The following both raise an exception:
df = pd.DataFrame({'col': [True, pd.NA]}).astype({'col': bool})
df = pd.DataFrame({'col': [1, pd.NA]}).astype({'col': int})
So to me it seems:
- For a column with nulls, Pandas does not accept
boolorint, you must use Pandas dtypes - For an
pd.Int64Dtypecolumn, Pandera accepts anintannotation - For a
pd.BooleanDtypecolumn, Pandera does not accept aboolannotation
To me it seems that bools and ints are handled consistently by Pandas and inconsistently by Pandera, but maybe I'm missing something.
For an pd.Int64Dtype column, Pandera accepts an int annotation
This is a bug. This should raise an error
For a pd.BooleanDtype column, Pandera does not accept a bool annotation
This seems to be the right behavior:
df = pd.DataFrame({'col': [True, pd.NA]}).astype({'col': pd.BooleanDtype()})
class Schema(pa.DataFrameModel):
col: bool = pa.Field(nullable=True) # this should be of type pd.BooleanDtype
Schema.validate(df) # Bang!