pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Non-informative message in case of categorical data

Open KiaXdice opened this issue 3 years ago • 1 comments

Describe the bug If the categories do not coincide between the schema and the data frame, the resulting error message is confusing for the end user. Unfortunately, the cause of the error is almost impossible to trace.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.
  • [ ] (optional) I have confirmed this bug exists on the master branch of pandera.

Code Sample, a copy-pastable example

import pandera
import pandas as pd

schema = pandera.DataFrameSchema(
            {
                "some_column": pandera.Column(
                    dtype=pd.CategoricalDtype(
                        categories=['A', 'B']
                    )
                )
            }
        )

df=pd.DataFrame(
            {
                # NOTE ([email protected], 2022-08-31):
                # The category 'C' is not according to the schema.
                "some_column": pd.Series(
                    ['C', 'C', 'A', 'B', 'A'],
                    dtype=pd.CategoricalDtype(['A', 'B', 'C'])
                )
            }
        )

schema.validate(df)

This gives the following error:

Traceback (most recent call last):
  File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 518, in validate
    return self._validate(
  File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 716, in _validate
    error_handler.collect_error("schema_component_check", err)
  File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
  File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 708, in _validate
    result = schema_component(
  File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 2074, in __call__
    return self.validate(
  File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schema_components.py", line 215, in validate
    validate_column(check_obj, column_name)
  File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schema_components.py", line 188, in validate_column
    super(Column, copy(self).set_name(column_name)).validate(
  File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\schemas.py", line 2007, in validate
    error_handler.collect_error(
  File "C:\Users\Kiavash\workspace\ims-inno-projects\MaintAIn\venv\lib\site-packages\pandera\error_handlers.py", line 32, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected series 'some_column' to have type category, got category

Expected behavior

We would have expected to see the categories listed or at list some hint that the categories do not match.

KiaXdice avatar Aug 31 '22 14:08 KiaXdice

Please let us know if we should create a pull request to fix this issue, and how you would like to inform the user. Thanks a lot in advance for looking into this!

KiaXdice avatar Aug 31 '22 14:08 KiaXdice

I think you are defining your schema incorrectly.

This is how it should look like:

schema = pandera.DataFrameSchema(
    {
        "some_column": pandera.Column(
            dtype=pd.CategoricalDtype,
            checks=pandera.Check.isin(['A','B'])
        )
    }
)

See here about using types and not instances for your column types.

abyz0123 avatar Nov 17 '22 14:11 abyz0123

@KiaXdice yeah the str representation of the Categorical dtype should be updated. The __str__ method in the pandas_engine.Category class can override the default implementation here: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/pandas_engine.py#L568

cosmicBboy avatar Nov 17 '22 16:11 cosmicBboy

do you have capacity to make this change ^^ @KiaXdice ?

cosmicBboy avatar Nov 17 '22 16:11 cosmicBboy