pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Does pandera support validating enum.Enum or subclases of it ?

Open cosmicBboy opened this issue 3 years ago • 11 comments

Discussed in https://github.com/unionai-oss/pandera/discussions/907

Originally posted by davidandreoletti August 8, 2022 Assuming a pedantic like class declaration with:

class SizeEnum(enum.Enum):
     BIG = "big"
     SMALL = "small"
class SummaryDFSchema(pandera...):
     size : pandera.Series[SizeEnum]
     name : ...

Currently pandera fails (via exception raised) because it seems pandas does not reconignize the Enum as a registered custom dtype.

What methods/workaround could be used to let pandera enforce/check the column contains SizeEnum types (rather than one of its string values such as "big")?

cosmicBboy avatar Aug 11 '22 14:08 cosmicBboy

The DataTypes extension api would allow support for Enums, see: https://pandera.readthedocs.io/en/stable/dtypes.html

One of the main design choices here would be: how to represent enums? There could be several options to represent these in the underlying dataframe:

  • object type
  • categorical type
  • string type

From experience, there may be some issues using the actual Enum as an object type when it comes to certain operations, tho I haven't tested it out in a while.

Open to ideas and perhaps a PR @davidandreoletti?

cosmicBboy avatar Aug 11 '22 14:08 cosmicBboy

This is a cool idea. It's something I've added in for a specific use case of mine, though I admit all my Enum subclasses are strings so I'm not hitting on any obscure cases. Gotta be a better way to do this than what I hacked together, but I added a step to SchemaModel.__init_subclass__ to update fields with Enum annotations to have categorical types with the categories defined by the Enum values.

One argument for using categorical type is that it can handle data types other than strings:

This one is the first example in the enum docs:

from enum import Enum
import pandas as pd

class Color(Enum):
    RED = 1
    GREEN = 2
    BLUE = 3


class MySchema(SchemaModel):
    color: Series[Color]


df = pd.DataFrame({"color": [1, 2, 3]})
MySchema.validate(df)
  color
0     1
1     2
2     3
df = pd.DataFrame({"color": [1, 2, 3, 4]})
MySchema.validate(df)
...
pandera.errors.SchemaError: Error while coercing 'color' to type category: Could not coerce <class 'pandas.core.series.Series'> data_container into type category:
   index  failure_case
0      3             4

That said, not sure what to do about Enum subclasses that have values that are not scalars:

class Planet(Enum):
    MERCURY = (3.303e+23, 2.4397e6)
    VENUS   = (4.869e+24, 6.0518e6)
    EARTH   = (5.976e+24, 6.37814e6)
    MARS    = (6.421e+23, 3.3972e6)
    JUPITER = (1.9e+27,   7.1492e7)
    SATURN  = (5.688e+26, 6.0268e7)
    URANUS  = (8.686e+25, 2.5559e7)
    NEPTUNE = (1.024e+26, 2.4746e7)
    def __init__(self, mass, radius):
        self.mass = mass       # in kilograms
        self.radius = radius   # in meters

In this case, the expected values in the series would be (float, float) tuples that correspond to values of a Planet value. Maybe that is ok.

It seems an important question regards whether the expected series values are instances of the Enum (i.e. Color.RED) or the Enum values (1). I would think the values, but thoughts on that?

the-matt-morris avatar Aug 12 '22 14:08 the-matt-morris

It seems an important question regards whether the expected series values are instances of the Enum (i.e. Color.RED) or the Enum values (1). I would think the values, but thoughts on that?

@davidandreoletti what do you think?

cosmicBboy avatar Aug 12 '22 15:08 cosmicBboy

For string-type enums, I've found a fairly elegant solution is to pass them directly to pandera.Field as the isin= argument. I don't know if this would work for non-hashable python objects.

import enum
import pandera
import pandas as pd

class TestEnum(str, enum.Enum):
    CLASS_1 = "class 1"
    CLASS_2 = "class 2"

class Schema(pandera.SchemaModel):
    class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum)

dantheand avatar Aug 17 '22 00:08 dantheand

@cosmicBboy I think @dantheand's solution is elegant and support most data types.

Perhaps, class_col: Series[TestEnum] should be a syntactic sugar for class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) ?

davidandreoletti avatar Aug 17 '22 06:08 davidandreoletti

Perhaps, class_col: Series[TestEnum] should be a syntactic sugar for class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum).

So data types and checks are intentionally separated concerns in pandera... the reason being that data types have the additional capability of coercing (i.e. parsing) raw data into the desired types, which correspond to some machine-level representation (e.g. int64, str, etc) whereas checks are simply functions that validate a property of the potentially-coerced data.

Conflating the two by converting class_col: Series[TestEnum] to class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) under the hood would introduce additional complexity to the library with not that many keystrokes saved.

The takeaway here is that class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) is a good enough solution for supporting enums in pandera, but for a deeper integration with the type system, defining a custom data type would be necessary if you want to take advantage of encoding Enums as pandas Categorical types, for example.

Why? Because Enums are not limited to string dtypes, could potentially be ordered (pandas categoricals support this, but python doesn't out-of-the box), and .

If anyone's down to make a PR for this, I'd welcome it! @davidandreoletti @the-matt-morris @dantheand

cosmicBboy avatar Aug 19 '22 15:08 cosmicBboy

@davidandreoletti @cosmicBboy I actually just found out that using class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) results in a yaml.representer.RepresenterError when trying to do SchemaModel.to_yaml(), so the best solution I've found is to convert to list of strings via list comp:

class_col: Series[pd.StringDtype] = pandera.Field(isin=[entry.value for entry in TestEnum])

@cosmicBboy When you say "custom data type", do you mean Logical Data type described in the documentation you linked?

dantheand avatar Aug 19 '22 19:08 dantheand

I actually just found out that using class_col: Series[pd.StringDtype] = pandera.Field(isin=TestEnum) results in a yaml.representer.RepresenterError when trying to do SchemaModel.to_yaml()

Ah! yeah the serialize checks logic would need to be updated to handle enums... feel free to open up a new issue if you want first class support here.

When you say "custom data type", do you mean Logical Data type described in the documentation you linked?

Yep! Although in this case a PR would make it a built-in datatype in pandera.engines.pandas_engine

cosmicBboy avatar Aug 19 '22 19:08 cosmicBboy

hi guys is there an update on this issue

racash007 avatar Apr 03 '23 09:04 racash007