pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Support for multi type (Unions) in schemas and validation

Open vianmixtkz opened this issue 2 years ago • 9 comments

Is your feature request related to a problem? Please describe.

I would like pandera to support Union Type. That is the validation of a Series/Column should allow multiple types. Pydantic allows it.

Here an example of my issue

from typing import Union
import pandas as pd
import pandera as pa
from pandera.typing import Series

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
    comment : Series[Union[str, float]] = pa.Field()

class OutputSchema(InputSchema):
    revenue: Series[float]

df = pd.DataFrame({
    "year": ["2001", "2002", "2003"],
    "month": ["3", "6", "12"],
    "day": ["200", "156", "365"],
    "comment":["test", float("nan"), "test"]
})

InputSchema(df) # raises TypeError Cannot interpret 'typing.Union[str, float]' as a data type

Describe the solution you'd like

I think it is the desired behavior for now to not allow Unions. But could you consider an option to allow it in the future ?

Describe alternatives you've considered

Split the Union columns into multiple columns, one for each type but this is not really something that I can control. Cf next section.

Additional context

I have a valid use case for this. I am using pandas to handle CSVs where some columns contain hybrid data types. I am using pandas for the preprocessing and pydantic for the validation, and I would like to use pandera to make this process (processing + validation) more robust

vianmixtkz avatar Apr 04 '23 14:04 vianmixtkz

@vianmixtkz Great writeup. This is something that would be great for Pandera to support.

johnkangw avatar Apr 06 '23 20:04 johnkangw

Thanks @vianmixtkz this is an interesting use case: the way pandas handles mixed-type columns is to represent the data in an object dtype column.

One thing we should clarify in the semantics of this feature is the following: we can interpret Union[str, float] either as:

  1. the column is either a str column or a float column
  2. the column is an object column that contains either str or float values

Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:

  • if the column is str dtype, then pass
  • if the column is float dtype, then pass
  • if the column is object data type, check that values are str or float. If so, then pass.
  • fail if none of the above conditions are met.

cosmicBboy avatar Apr 06 '23 21:04 cosmicBboy

Here what I described is matching case 2. That's is in a given column, I'll have for example str on some rows and floats on other rows. But it would be nice to support both cases anyway.

With something like:

Case 1

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
    comment : Union[Series[str], Series[float]] = pa.Field() # comment is either only str or only float in a given DataFrame

Case 2

class InputSchema(pa.DataFrameModel):
    year: Series[int] = pa.Field(gt=2000, coerce=True)
    month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
    day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
    comment : Series[Union[str,float]] = pa.Field() # comment is a column containing str on some rows and float on other rows

And yeah, I think the behavior you are describing is what users would expect

Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:

if the column is str dtype, then pass # passes in case 1 and 2 if the column is float dtype, then pass # passes in case 1 and 2 if the column is object data type, check that values are str or float. If so, then pass. # passes only in case 2 fail if none of the above conditions are met.

vianmixtkz avatar Apr 06 '23 21:04 vianmixtkz

Just bumping this thread.

Any consensus how to proceed? Seem like the #1227 is stale.

aaravind100 avatar Oct 05 '23 12:10 aaravind100

Revisiting this issue and thinking about it a little bit, here's another proposal for this issue:

from pandera.engines.pandas_engine import Object
from typing import Annotated

class Model(pa.DataFrameModel):
    union_column : Union[str, float]  # the column data type must be either a str or float

    object_column: Object = pa.Field(dtype_kwargs={"allowable_types": [str, float]})
    # or use the annotated types
    object_column: Annotated[Object, [str, float]]

This syntax is less ambiguous as to what the actual type of the column is vs. the values within it are. However, it does require importing a special Object type.

I'm still open to the more ambiguous behavior where Union[str, float] would cover all of these cases though. Open to further discussion on this!

cosmicBboy avatar Mar 30 '24 15:03 cosmicBboy

Re: this proposal: https://github.com/unionai-oss/pandera/issues/1152#issuecomment-1499660502

Unfortunately col: Series[TYPE] and col: TYPE in a DataFrameModel are equivalent so Union[Series[str], Series[float]] and Series[Union[str,float]] would effectively be equivalent, and would also introduce more complexity to the handling of types in DataFrameModel, which I don't think would be worth it.

cosmicBboy avatar Mar 30 '24 16:03 cosmicBboy

I'm not a fan of this case Union[Series[str], Series[float]] from this comment, where the series would consists of only string or only float. Its very ambiguous, the output would sorta change depending on what data you pass. These could be very well their own distinct schema.

Series[Union[str, float]] or Union[str, float] or str | float # python 3.10+, where the output could be either string or float. This case is more consistent.

aaravind100 avatar Apr 01 '24 17:04 aaravind100