pandera
pandera copied to clipboard
Index of type category fails on validation
Validation for an index of type "category" fails starting from version 0.8.0
Minimal reproducible example
import pandas as pd
import pandera as pa
class Schema(pa.SchemaModel):
categorical_field: pa.typing.Index[pa.Category]
df = (
pd.DataFrame({"categorical_field": ["a", "b", "c"]})
.astype({"categorical_field": "category"})
.set_index("categorical_field")
)
Schema.validate(df)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/env/lib/python3.8/site-packages/pandera/model.py", line 256, in validate
cls.to_schema().validate(
File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 513, in validate
return self._validate(
File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 709, in _validate
error_handler.collect_error("schema_component_check", err)
File "/env/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
raise schema_error from original_exc
File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 701, in _validate
result = schema_component(
File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 2043, in __call__
return self.validate(
File "/env/lib/python3.8/site-packages/pandera/schema_components.py", line 390, in validate
super().validate(
File "/env/lib/python3.8/site-packages/pandera/schemas.py", line 1976, in validate
error_handler.collect_error(
File "/env/lib/python3.8/site-packages/pandera/error_handlers.py", line 32, in collect_error
raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'categorical_field' to have type category, got object
Where it fails Here https://github.com/pandera-dev/pandera/blob/9a463e1757e2811bbfee4684562541a5f2110cc3/pandera/schema_components.py#L385-L387 the index gets converted to a numpy array but Categorical is not a numpy array and therefore validation fails
removing the numpy conversion lets the validation pass, but I do not know what else it would/could influence
thanks for reporting this @aartaria, this is definitely a bug!
don't exactly remember now why that to_numpy
call is there, can you see which unit tests fail if you remove it? I have a suspicion it's there for the sake of supporting the pandas-like frameworks (pyspark.pandas, modin, or dask) but yeah ideally that wouldn't need to be called
https://github.com/unionai-oss/pandera/pull/856 fixed this apparantly, but just gonna keep this open, since #856 didn't add unit tests for the changes