pandera icon indicating copy to clipboard operation
pandera copied to clipboard

feature(pandas): Support string column validation for pandas 2.1.3

Open karlma821 opened this issue 2 years ago • 1 comments

Is your feature request related to a problem? Please describe.

  • In pandas 2.1.2, Series.map returns the series with a modified dtype object even if the Series is empty.
  • In pandas 2.1.3, this behaviour does not hold anymore, the returned dtype is kept unchanged, which breaks the bitwise comparison of Series in the NpString.check.

https://github.com/unionai-oss/pandera/blob/4425ad8012342960c98f673206a4149ce4cd22dc/pandera/engines/pandas_engine.py#L721

Originally the above bitwise OR comparison always return a Series with dtype bool, which then .all() can be used during column validation.

Describe the solution you'd like Always cast to dtype bool before bitwise comparison, as what the variable name is_python_string described

    def check(
        self,
        pandera_dtype: dtypes.DataType,
        data_container: Optional[PandasObject] = None,
    ) -> Union[bool, Iterable[bool]]:
        if data_container is None:
            return isinstance(pandera_dtype, (numpy_engine.Object, type(self)))

        # NOTE: this is a hack to handle the following case:
        # pyspark.pandas doesn't support types with a Series of type object
        if type(data_container).__module__.startswith("pyspark.pandas"):
            is_python_string = data_container.map(lambda x: str(type(x))).isin(  # type: ignore[operator]
                ["<class 'str'>", "<class 'numpy.str_'>"]
            )
        else:
            is_python_string = data_container.map(lambda x: isinstance(x, str))  # type: ignore[operator]
-       return is_python_string | data_container.isna()
+       return is_python_string.astype(bool) | data_container.isna()

Describe alternatives you've considered Fix the breaking change in pandas.

Additional context Add any other context or screenshots about the feature request here.

Pandas 2.1.2

Series.map for empty Series

Screenshot 2023-11-15 at 12 54 43 PM

Bitwise comparison of dtype=object and dtype=bool

Screenshot 2023-11-15 at 1 30 48 PM

Pandas 2.1.3

Series.map for empty Series

Screenshot 2023-11-15 at 12 59 26 PM

Bitwise comparison of dtype=string and dtype=bool

Screenshot 2023-11-15 at 12 33 07 PM

karlma821 avatar Nov 15 '23 05:11 karlma821

@karlma821 please feel free to make a PR for this!

cosmicBboy avatar Nov 15 '23 15:11 cosmicBboy