Non-informative message in case of categorical data
Describe the bug If the categories do not coincide between the schema and the data frame, the resulting error message is confusing for the end user. Unfortunately, the cause of the error is almost impossible to trace.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [ ] (optional) I have confirmed this bug exists on the master branch of pandera.
Code Sample, a copy-pastable example
import pandera
import pandas as pd
schema = pandera.DataFrameSchema(
{
"some_column": pandera.Column(
dtype=pd.CategoricalDtype(
categories=['A', 'B']
)
)
}
)
df=pd.DataFrame(
{
# NOTE ([email protected], 2022-08-31):
# The category 'C' is not according to the schema.
"some_column": pd.Series(
['C', 'C', 'A', 'B', 'A'],
dtype=pd.CategoricalDtype(['A', 'B', 'C'])
)
}
)
schema.validate(df)
This gives the following error:
Traceback (most recent call last):
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 518, in validate
return self._validate(
File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 716, in _validate
error_handler.collect_error("schema_component_check", err)
File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\error_handlers.py", line 32, in collect_error
raise schema_error from original_exc
File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 708, in _validate
result = schema_component(
File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 2074, in __call__
return self.validate(
File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schema_components.py", line 215, in validate
validate_column(check_obj, column_name)
File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schema_components.py", line 188, in validate_column
super(Column, copy(self).set_name(column_name)).validate(
File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 2007, in validate
error_handler.collect_error(
File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\error_handlers.py", line 32, in collect_error
raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'some_column' to have type category, got category
Expected behavior
We would have expected to see the categories listed or at list some hint that the categories do not match.
Please let us know if we should create a pull request to fix this issue, and how you would like to inform the user. Thanks a lot in advance for looking into this!
I think you are defining your schema incorrectly.
This is how it should look like:
schema = pandera.DataFrameSchema(
{
"some_column": pandera.Column(
dtype=pd.CategoricalDtype,
checks=pandera.Check.isin(['A','B'])
)
}
)
See here about using types and not instances for your column types.
@KiaXdice yeah the str representation of the Categorical dtype should be updated. The __str__ method in the pandas_engine.Category class can override the default implementation here: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/pandas_engine.py#L568
do you have capacity to make this change ^^ @KiaXdice ?