pandera Poor interaction between categorical dtype, coerce=True and nullable=True

Describe the bug (I'm not exactly sure if this is better placed as a bug, enhancement or question, it's kind of a mix)

I think this is easiest explained by example, but there is an issue where categorical dtypes with a specified category set silently cast fields not in this set to nan values - in the second case below, apple is misspelt and gets silently cast to nan:

import pandera as pa
import pandas as pd

class Test(pa.SchemaModel):
   foo : pa.typing.Series[pd.CategoricalDtype] = pa.Field(dtype_kwargs={"categories":["apple", "banana"]}, nullable=True, coerce=True)

srs = pd.DataFrame({"foo": ["apple", "banana", pd.NA]})
print(pa.__version__)
print(Test.validate(srs))

srs2 = pd.DataFrame({"foo": ["appppple", "banana", pd.NA]})
print(Test.validate(srs2))
#########################
0.9.0
      foo
0   apple
1  banana
2     NaN
      foo
0     NaN
1  banana
2     NaN

Expected behavior

Ideally, I would expect the categories to be checked before coercion accepting nullables, but my understanding is that the coercion always occurs first in pandera.

It might be more realistic then to expect an error/ warning in this case - I can't see when this silent dropping to nan would be a desirable outcome.

Additional context / alternatives

There are a couple of obvious alternatives to this (which might be why this is better suited as a question than a bug):

Don't use coerce=True - this doesn't really work in my use case as I'm using pandera to validate raw / external csv data and type fields to a discrete list of values with categories, so coerce is really valuable
Don't use nullable=True - This would be awkward as I have genuine nulls in my data, I would need to preprocess and replace them to some non null value, validate with pandera and then update all the categories to cast the fake nulls back to nulls
Don't use categorical fields, use object fields with isin checks - This is the partial solution I have gone with for now - for data with nulls I'm using pa.STRING, and those without nulls using categories. This works for the most part so long as I never accidentally pass nullable=True to something annotated as a categorical.

Feb 27 '22 02:02 m-richards

hey @m-richards, since pandera uses the casting logic of pandas, the casting of unrecognized values in a categorical dtype to nulls is expected behavior (as a pandas user):

import pandas as pd


print(
    pd.Series(["apple", "bananas", "foo"])
    .astype(pd.CategoricalDtype(["apple", "bananas"]))
)
# 0      apple
# 1    bananas
# 2        NaN
# dtype: category
# Categories (2, object): ['apple', 'bananas']

I think pandera can be more useful here with a warning, but I think this behavior is fairly reasonable given that we're delegating casting behavior to pandas.

Given your requirements, using a string or object datatype with an isin check would be the most appropriate.

If you're open to making a contribution, another possibility is to extend pandera's Category dtype to have some kind of strict kwarg that makes raises a ParserError if there are non-null values that are not in the valid set of categories. You'd have to implement the coerce method (see here) to catch those cases.

Feb 27 '22 04:02 cosmicBboy

Hey @cosmicBboy that's basically what I expected to hear - that delgating to pandas is as intended.

I will have a look at adding a strict kwarg - that sounds like it fits my use case quite well ( I left it out above to keep it simpler, but my use case doesn't contain nulls, and I was looking for a way to protect another person I'm working with from using nulls from having records dropped). We have a hacky local fix at the moment, but figured an in library solution would be better.

Feb 28 '22 08:02 m-richards

cool, let me know if you have any questions around adding the strict kwarg!

Mar 01 '22 01:03 cosmicBboy