pandera
pandera copied to clipboard
feature(pandas): Support string column validation for pandas 2.1.3
Is your feature request related to a problem? Please describe.
- In pandas 2.1.2,
Series.mapreturns the series with a modified dtypeobjecteven if the Series is empty. - In pandas 2.1.3, this behaviour does not hold anymore, the returned dtype is kept unchanged, which breaks the bitwise comparison of Series in the
NpString.check.
https://github.com/unionai-oss/pandera/blob/4425ad8012342960c98f673206a4149ce4cd22dc/pandera/engines/pandas_engine.py#L721
Originally the above bitwise OR comparison always return a Series with dtype bool, which then .all() can be used during column validation.
Describe the solution you'd like
Always cast to dtype bool before bitwise comparison, as what the variable name is_python_string described
def check(
self,
pandera_dtype: dtypes.DataType,
data_container: Optional[PandasObject] = None,
) -> Union[bool, Iterable[bool]]:
if data_container is None:
return isinstance(pandera_dtype, (numpy_engine.Object, type(self)))
# NOTE: this is a hack to handle the following case:
# pyspark.pandas doesn't support types with a Series of type object
if type(data_container).__module__.startswith("pyspark.pandas"):
is_python_string = data_container.map(lambda x: str(type(x))).isin( # type: ignore[operator]
["<class 'str'>", "<class 'numpy.str_'>"]
)
else:
is_python_string = data_container.map(lambda x: isinstance(x, str)) # type: ignore[operator]
- return is_python_string | data_container.isna()
+ return is_python_string.astype(bool) | data_container.isna()
Describe alternatives you've considered Fix the breaking change in pandas.
Additional context Add any other context or screenshots about the feature request here.
Pandas 2.1.2
Series.map for empty Series
Bitwise comparison of dtype=object and dtype=bool
Pandas 2.1.3
Series.map for empty Series
Bitwise comparison of dtype=string and dtype=bool
@karlma821 please feel free to make a PR for this!