pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Index of type category fails on validation

Open aartaria opened this issue 2 years ago • 2 comments

Validation for an index of type "category" fails starting from version 0.8.0

Minimal reproducible example

import pandas as pd
import pandera as pa


class Schema(pa.SchemaModel):
    categorical_field: pa.typing.Index[pa.Category]


df = (
    pd.DataFrame({"categorical_field": ["a", "b", "c"]})
    .astype({"categorical_field": "category"})
    .set_index("categorical_field")
)
Schema.validate(df)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/env/lib/python3.8/site-packages/pandera/model.py", line 256, in validate
    cls.to_schema().validate(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 513, in validate
    return self._validate(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 709, in _validate
    error_handler.collect_error("schema_component_check", err)
  File "/env/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 701, in _validate
    result = schema_component(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 2043, in __call__
    return self.validate(
  File "/env/lib/python3.8/site-packages/pandera/schema_components.py", line 390, in validate
    super().validate(
  File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 1976, in validate
    error_handler.collect_error(
  File "/env/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'categorical_field' to have type category, got object

Where it fails Here https://github.com/pandera-dev/pandera/blob/9a463e1757e2811bbfee4684562541a5f2110cc3/pandera/schema_components.py#L385-L387 the index gets converted to a numpy array but Categorical is not a numpy array and therefore validation fails

removing the numpy conversion lets the validation pass, but I do not know what else it would/could influence

aartaria avatar Apr 26 '22 10:04 aartaria

thanks for reporting this @aartaria, this is definitely a bug!

don't exactly remember now why that to_numpy call is there, can you see which unit tests fail if you remove it? I have a suspicion it's there for the sake of supporting the pandas-like frameworks (pyspark.pandas, modin, or dask) but yeah ideally that wouldn't need to be called

cosmicBboy avatar May 06 '22 13:05 cosmicBboy

https://github.com/unionai-oss/pandera/pull/856 fixed this apparantly, but just gonna keep this open, since #856 didn't add unit tests for the changes

cosmicBboy avatar Aug 06 '22 21:08 cosmicBboy