pandera
pandera copied to clipboard
Support for multi type (Unions) in schemas and validation
Is your feature request related to a problem? Please describe.
I would like pandera to support Union Type. That is the validation of a Series/Column should allow multiple types. Pydantic allows it.
Here an example of my issue
from typing import Union
import pandas as pd
import pandera as pa
from pandera.typing import Series
class InputSchema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
comment : Series[Union[str, float]] = pa.Field()
class OutputSchema(InputSchema):
revenue: Series[float]
df = pd.DataFrame({
"year": ["2001", "2002", "2003"],
"month": ["3", "6", "12"],
"day": ["200", "156", "365"],
"comment":["test", float("nan"), "test"]
})
InputSchema(df) # raises TypeError Cannot interpret 'typing.Union[str, float]' as a data type
Describe the solution you'd like
I think it is the desired behavior for now to not allow Unions. But could you consider an option to allow it in the future ?
Describe alternatives you've considered
Split the Union columns into multiple columns, one for each type but this is not really something that I can control. Cf next section.
Additional context
I have a valid use case for this. I am using pandas to handle CSVs where some columns contain hybrid data types. I am using pandas for the preprocessing and pydantic for the validation, and I would like to use pandera to make this process (processing + validation) more robust
@vianmixtkz Great writeup. This is something that would be great for Pandera to support.
Thanks @vianmixtkz this is an interesting use case: the way pandas handles mixed-type columns is to represent the data in an object dtype column.
One thing we should clarify in the semantics of this feature is the following: we can interpret Union[str, float] either as:
- the column is either a
strcolumn or afloatcolumn - the column is an
objectcolumn that contains eitherstrorfloatvalues
Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:
- if the column is
strdtype, then pass - if the column is
floatdtype, then pass - if the column is
objectdata type, check that values arestrorfloat. If so, then pass. - fail if none of the above conditions are met.
Here what I described is matching case 2. That's is in a given column, I'll have for example str on some rows and floats on other rows. But it would be nice to support both cases anyway.
With something like:
Case 1
class InputSchema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
comment : Union[Series[str], Series[float]] = pa.Field() # comment is either only str or only float in a given DataFrame
Case 2
class InputSchema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
comment : Series[Union[str,float]] = pa.Field() # comment is a column containing str on some rows and float on other rows
And yeah, I think the behavior you are describing is what users would expect
Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:
if the column is str dtype, then pass # passes in case 1 and 2 if the column is float dtype, then pass # passes in case 1 and 2 if the column is object data type, check that values are str or float. If so, then pass. # passes only in case 2 fail if none of the above conditions are met.
Just bumping this thread.
Any consensus how to proceed? Seem like the #1227 is stale.
Revisiting this issue and thinking about it a little bit, here's another proposal for this issue:
from pandera.engines.pandas_engine import Object
from typing import Annotated
class Model(pa.DataFrameModel):
union_column : Union[str, float] # the column data type must be either a str or float
object_column: Object = pa.Field(dtype_kwargs={"allowable_types": [str, float]})
# or use the annotated types
object_column: Annotated[Object, [str, float]]
This syntax is less ambiguous as to what the actual type of the column is vs. the values within it are. However, it does require importing a special Object type.
I'm still open to the more ambiguous behavior where Union[str, float] would cover all of these cases though. Open to further discussion on this!
Re: this proposal: https://github.com/unionai-oss/pandera/issues/1152#issuecomment-1499660502
Unfortunately col: Series[TYPE] and col: TYPE in a DataFrameModel are equivalent so Union[Series[str], Series[float]] and Series[Union[str,float]] would effectively be equivalent, and would also introduce more complexity to the handling of types in DataFrameModel, which I don't think would be worth it.
I'm not a fan of this case Union[Series[str], Series[float]] from this comment, where the series would consists of only string or only float. Its very ambiguous, the output would sorta change depending on what data you pass. These could be very well their own distinct schema.
Series[Union[str, float]] or Union[str, float] or str | float # python 3.10+, where the output could be either string or float. This case is more consistent.