pandera Index of type category fails on validation

Validation for an index of type "category" fails starting from version 0.8.0

Minimal reproducible example

import pandas as pd
import pandera as pa


class Schema(pa.SchemaModel):
    categorical_field: pa.typing.Index[pa.Category]


df = (
    pd.DataFrame({"categorical_field": ["a", "b", "c"]})
    .astype({"categorical_field": "category"})
    .set_index("categorical_field")
)
Schema.validate(df)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/env/lib/python3.8/site-packages/pandera/model.py", line 256, in validate
    cls.to_schema().validate(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 513, in validate
    return self._validate(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 709, in _validate
    error_handler.collect_error("schema_component_check", err)
  File "/env/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 701, in _validate
    result = schema_component(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 2043, in __call__
    return self.validate(
  File "/env/lib/python3.8/site-packages/pandera/schema_components.py", line 390, in validate
    super().validate(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 1976, in validate
    error_handler.collect_error(
  File "/env/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'categorical_field' to have type category, got object

Where it fails Here https://github.com/pandera-dev/pandera/blob/9a463e1757e2811bbfee4684562541a5f2110cc3/pandera/schema_components.py#L385-L387 the index gets converted to a numpy array but Categorical is not a numpy array and therefore validation fails

removing the numpy conversion lets the validation pass, but I do not know what else it would/could influence

Apr 26 '22 10:04 aartaria

thanks for reporting this @aartaria, this is definitely a bug!

don't exactly remember now why that to_numpy call is there, can you see which unit tests fail if you remove it? I have a suspicion it's there for the sake of supporting the pandas-like frameworks (pyspark.pandas, modin, or dask) but yeah ideally that wouldn't need to be called

May 06 '22 13:05 cosmicBboy

https://github.com/unionai-oss/pandera/pull/856 fixed this apparantly, but just gonna keep this open, since #856 didn't add unit tests for the changes

Aug 06 '22 21:08 cosmicBboy

pandera pandera copied to clipboard

Index of type category fails on validation

pandera
pandera copied to clipboard