pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Pydantic types support

Open ghilesmeddour opened this issue 2 years ago • 11 comments

Hi,

I think that It's would be very convenient to support pydantic types in pandera, since pandera and pydantic and used together very often. What do you think about that ?

Describe alternatives you've considered Use available types and add manual checks.

ghilesmeddour avatar Jul 19 '22 09:07 ghilesmeddour

Hi @ghilesmeddour did you check out https://pandera.readthedocs.io/en/stable/pydantic_integration.html?

You can currently use pydantic types at the DataFrameSchema level, so pydantic models whose fields are scalar types should validate your dataframe in a row-wise manner. Is this what you were thinking or do you have another use case?

cosmicBboy avatar Jul 20 '22 14:07 cosmicBboy

HI @cosmicBboy,

Thanks for your response.

For example, let's say we want to validate some field using pydantic.UUID4, following code

import uuid
from pydantic import UUID4

import pandas as pd

import pandera as pa
from pandera.typing import Series

class InputSchema(pa.SchemaModel):
    a: Series[UUID4] = pa.Field()
    b: Series[int] = pa.Field()
    
df = pd.DataFrame({'a': [uuid.uuid4(), uuid.uuid4()], 'b': [3, 4]})

InputSchema.validate(df)

raises TypeError: dtype '<class 'pydantic.types.UUID4'>' not understood.

I guess this is documented here, but I couldn't understand if it is not supported or if it needs a special syntax.

ghilesmeddour avatar Jul 20 '22 15:07 ghilesmeddour

Hey @ghilesmeddour,

This confused me a bit at first too. The only types you can pass to the pandera.typing.Series type annotation in SchemaModel classes are the pandas dtypes (and I think pandera custom dtypes as well?). This type checking / coercion is done on the column level and is very quick.

If you want to do more specific type-checking than those available as pandas dtypes (like checking that an entry is a UUID type), then you'll have to use row-wise validators, which, fair warning, can be quite slow.

To accomplish this using a standard pydantic model you can follow the documentation linked above and define a row-wise validation to do. Your specific example would look something like:

from pydantic import BaseModel
import pandera as pa
import uuid
import pandas as pd
from pandera.engines.pandas_engine import PydanticModel

class Record(BaseModel):
    a: uuid.UUID
    b: int

class PydanticSchema(pa.SchemaModel):
    """Pandera schema using the pydantic model."""

    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(Record)
        coerce = True  # this is required, otherwise a SchemaInitError is raised

Trying out this should fail:

df = pd.DataFrame({'a': ['not_uuid', uuid.uuid4()], 'b': [3, 4]})
PydanticSchema.validate(df)

A more verbose method would be to define custom checks for each column that checks the type.

dantheand avatar Jul 21 '22 02:07 dantheand

I agree that it would be great and intuitive to be able to pass the types directly to pa.Series similar to how pydantic works. As long as it didn't greatly slow down the validation.

dantheand avatar Jul 21 '22 02:07 dantheand

thanks for answering @dantheand !

I agree that it would be great and intuitive to be able to pass the types directly to pa.Series

Indeed, this would be the ideal UX, but there are a few technical issues that are blocking this, the main one being that the pydantic type system works for python types, not for pandas/numpy types... the pandera type system in theory would allow us to add support for pydantic types, this would definitely be something I'd support contributions for! @jeffzi what do you think?

I'll keep this issue open to track this feature, I think it would be super valuable, though it seems like a fairly heavy lift.

@ghilesmeddour @dantheand besides UUID4 are there any other pydantic types that you'd like support for? We can keep a running list and prioritize high-value ones to implement.

cosmicBboy avatar Jul 21 '22 16:07 cosmicBboy

@cosmicBboy UUID isn't a pydantic type as far as I know, it's a type from the uuid library. It actually think the current dtype = PydanticModel(Record) solution above already provides the required functionality. It's mostly a UX ask here. It would be awesome to be able to combine pydantic-style subtyping like List[UUID] with the powerful validation methods provided by pa.Field().

Although, maybe it's best not to encourage people to use put arbitrary types in pandas dataframe columns...

dantheand avatar Jul 23 '22 22:07 dantheand

Although, maybe it's best not to encourage people to use put arbitrary types in pandas dataframe columns...

Agreed ! In the case of uuid, it would be more efficient to store it as a string and have a check on top of it to validate its format. I'm pretty sure @ghilesmeddour does not care about having a UUID object, and actually wants to ensure the data in the column is a valid UUID format.

We recently introduced the concept of logical data type (still in dev branch), allowing to

a. support non-native pandas type, e.g. Date (pandas only supports datetime), Decimal b. create commonly used types for better UX. Think of it as a type with an embedded check, and the ability to coerce which checks can't do. e.g.: Path, uuid, url

the pydantic type system works for python types, not for pandas/numpy types... the pandera type system in theory would allow us to add support for pydantic types, this would definitely be something I'd support contributions for! @jeffzi what do you think?

Pydantic does have "true" custom types such as Color or URL which are not based on the standard lib like UUID. Here is what I propose:

  1. Create logical data types counterparts to the pydantic custom types, implemented with vectorized pandas operations for maximum efficiency.
  2. Registers pydantic types as equivalent to the new logical data types. Pandera already supports registering parametrized data type so I don't see any blockers for that approach to work.

If we agree on this approach, the next step is to create an issue that references all the pydantic types we plan on supporting. The list of pydantic types can be found here.

I'll keep this issue open to track this feature, I think it would be super valuable, though it seems like a fairly heavy lift.

Absolutely! I'd love to add transformation capabilities to Pandera where it makes sense (not in checks imho :clown_face:). Those logical types would be very convenient for end users and fits in the current pandera approach.

jeffzi avatar Jul 24 '22 08:07 jeffzi

Agreed ! In the case of uuid, it would be more efficient to store it as a string and have a check on top of it to validate its format. I'm pretty sure @ghilesmeddour does not care about having a UUID object, and actually wants to ensure the data in the column is a valid UUID format.

For me, the question came up when trying to use pandera with FastAPI. For Pandera to integrate seamlessly with FastAPI (which is completely based on Pydantic for types validation and documentation), Pandera would have to be able to "recognise" all Pydantic types in my opinion.

For the actual representation of values, I guess that it will be more natural for the user to keep the same behaviour as Pydantic when possible (which coerce for the case of UUID).

import uuid
from pydantic import validate_arguments, UUID4

@validate_arguments
def foo(a: UUID4):
    print(type(a))
    
foo(uuid.uuid4())
foo(str(uuid.uuid4()))
<class 'uuid.UUID'>
<class 'uuid.UUID'>

ghilesmeddour avatar Jul 24 '22 20:07 ghilesmeddour

@ghilesmeddour Thanks for your input! The steps I proposed will enable Pandera to recognize Pydantic types in the same way Pandera recognizes multiple integer types. Internally, Pandera has its own data types that are mapped to the recognized types.

It would look like this (SchemaModel api would work too):

import pandera as pa
from pydantic import UUID4
import numpy as np
from uuid import UUID

schema = pa.DataFrameSchema(
    {
        "uuid_a": pa.Column(UUID4), # internally mapped to pandera.dtypes.UUID
        "uuid_b": pa.Column(UUID),
        # recognized int types
        "col1": pa.Column(int),
        "col2": pa.Column(pa.Int),
        "col3": pa.Column(np.int64),
        "col4": pa.Column("int"),
    },
    coerce=True,
)

The same logic would apply to a Path or URL type.

For the actual representation of values, I guess that it will be more natural for the user to keep the same behaviour as Pydantic when possible (which coerce for the case of UUID).

uuid.UUUID objects would be wrapped in a pandas object column. It's not efficient because you won't be able to use vectorized string operations. I would encode as strings and, for seamless integration with fastapi, we could customize how Pandera exports these types (to/from configuration).

For example, I think to_format = "dict" should output uuid.UUID objects but to_format = "json" should export strings. @cosmicBboy For that to work, we would have to allow Pandera DataTypes to customize export on top of their current responsibilities of customizing check and coerce.

jeffzi avatar Jul 25 '22 08:07 jeffzi

@jeffzi I really like the proposed feature. Until your comments, I didn't realize how much effort went into allowing arbitrary custom dtypes in pandera (e.g. choosing pandas-compatible column dtypes, parallelized checks, decisions on output format, etc.) It now makes sense why you wouldn't natively support arbitrary dtypes in pandas columns. The currently available pydantic record model method allows arbitrary types and is very slow.

There are also the pydantic constrained types like NegativeFloat, NegativeInt, PositiveFloat, PositiveInt which provide a concise and clear syntax, Most of them should be easily implementable with current check functions.

dantheand avatar Jul 26 '22 02:07 dantheand

I love this idea and appreciate the proposal provided! There are many situations where I use the wonderful pydantic constrained types or define a custom type. It is possible to create the same validation with pandera checks, but would be great to reuse the same pydantic types that are used elsewhere in the codebase. Would think there exists some solution that uses __get_validators__ methods, though it would be tricky to separate this out into check and coerce functionality as validation and coercion are typically both performed in __get_validators__.

Example

A (somewhat) typical situation:

Define custom pydantic type(s)

from datetime import date, datetime

import pandas as pd
from pydantic.types import ConstrainedInt
from pydantic.validators import number_size_validator


def to_epoch_time(v) -> int:
    if isinstance(v, date):
        return int(v.strftime("%s"))
    if isinstance(v, (int, float) or pd.api.types.is_numeric_dtype(v)):
        return int(v)
    raise TypeError(f"{v} is not a valid date or numeric type")


class EpochTime(ConstrainedInt):
    """Number of seconds since 1970"""

    @classmethod
    def __get_validators__(cls):
        yield to_epoch_time
        yield number_size_validator


class PastEpochTime(EpochTime):
    lt = datetime.now().timestamp()


class FutureEpochTime(EpochTime):
    gt = datetime.now().timestamp()

Currently

What (I think) is required to do this currently:

from pandera import Field, SchemaModel, dtypes
from pandera.engines import pandas_engine
from pandera.typing import Series


# Create custom data type
@pandas_engine.Engine.register_dtype
@dtypes.immutable
class EpochTime(pandas_engine.INT64):
    def coerce(self, series: pd.Series) -> pd.Series:
        if pd.api.types.is_datetime64_any_dtype(series):
            series = series.map(to_epoch_time)
        return series.astype("int64")


class MySchema(SchemaModel):
    timestamp: Series[EpochTime] = Field(lt=datetime.now().timestamp(), coerce=True)
    ...

Future??

class MySchema(SchemaModel):
    timestamp: Series[PastEpochTime]

There are likely a ton of complexities I'm unaware of, and I recognize this is likely not in scope for this issue, as adding support for pydantic defined types covers nearly all use cases. That said, if there was enough interest to look into supporting pydantic custom types, I'd be happy to collaborate on such a feature.

the-matt-morris avatar Aug 01 '22 15:08 the-matt-morris

Is this feature-request dead?

Bolognafingers avatar Apr 15 '23 11:04 Bolognafingers

Hello according to pandera docs you are available (at least now) to import pydantic model into pandera schema (and vice versa).

Maybe you can solve your problem with this

import uuid
from pydantic import UUID4,BaseModel

import pandas as pd

import pandera as pa
from pandera.typing import Series
from pandera.engines.pandas_engine import PydanticModel

class MyRecord(BaseModel):
    a:UUID4
    b:int
class InputSchema(pa.DataFrameModel):
    """Pandera schema using the pydantic model."""
    class Config:
        """Config with dataframe-level data type."""

        dtype = PydanticModel(MyRecord)
        coerce = True  # this is required, otherwise a SchemaInitError is raised
df = pd.DataFrame({'a': [uuid.uuid4(), uuid.uuid4()], 'b': [3, 4]})

InputSchema.validate(df)

What is cool with this combo is that you can reuse InputSchema in an other pydantic model

from pandera.typing import DataFrame
class MyCompleteData(BaseModel):
    metadata:str
    data:DataFrame[InputSchema]
good_pydantic_object = MyCompleteData(
    data=pd.DataFrame({'a': [uuid.uuid4(), uuid.uuid4()], 'b': [3, 4]}),
    metadata="Will work"
)
print(good_pydantic_object)
bad_pydantic_object = MyCompleteData(
    data=pd.DataFrame({'a': [uuid.uuid4(),"NOT A UUID"], 'b': [3, 4]}),
    metadata="validation error should occured"
)

Hope this help 👍

LucienChassin avatar Apr 21 '23 11:04 LucienChassin

I have a requirement where a pydantic model is a column in a dataframe along with other columns. Currently the last line in following code gives pandera.errors.SchemaInitError: PydanticModel dtype can only be specified as a DataFrameSchema dtype. error.

import pandas as pd
from pydantic import BaseModel
from pandera import DataFrameModel
from pandera.typing import Series
from pandera.engines.pandas_engine import PydanticModel

class MyModel(BaseModel):
    x: int
    y: str

class MySchema(DataFrameModel):
    a: Series[int]
    b: Series[PydanticModel(MyModel)]

    class Config:
        strict = "filter"
        coerce = True

df = pd.DataFrame({'a': [1,2], 'b':[{'x': 11, 'y': 'data1'}, {'x': 22, 'y': 'data2'}]})
df = MySchema.validate(df)

I want MySchema.validate(df) to coerce column "b" in each row to MyModel type like: image

karunpoudel-chr avatar Sep 13 '23 00:09 karunpoudel-chr