pandera
pandera copied to clipboard
Pydantic types support
Hi,
I think that It's would be very convenient to support pydantic types in pandera, since pandera and pydantic and used together very often. What do you think about that ?
Describe alternatives you've considered Use available types and add manual checks.
Hi @ghilesmeddour did you check out https://pandera.readthedocs.io/en/stable/pydantic_integration.html?
You can currently use pydantic types at the DataFrameSchema
level, so pydantic models whose fields are scalar types should validate your dataframe in a row-wise manner. Is this what you were thinking or do you have another use case?
HI @cosmicBboy,
Thanks for your response.
For example, let's say we want to validate some field using pydantic.UUID4
, following code
import uuid
from pydantic import UUID4
import pandas as pd
import pandera as pa
from pandera.typing import Series
class InputSchema(pa.SchemaModel):
a: Series[UUID4] = pa.Field()
b: Series[int] = pa.Field()
df = pd.DataFrame({'a': [uuid.uuid4(), uuid.uuid4()], 'b': [3, 4]})
InputSchema.validate(df)
raises TypeError: dtype '<class 'pydantic.types.UUID4'>' not understood
.
I guess this is documented here, but I couldn't understand if it is not supported or if it needs a special syntax.
Hey @ghilesmeddour,
This confused me a bit at first too. The only types you can pass to the pandera.typing.Series
type annotation in SchemaModel
classes are the pandas dtypes (and I think pandera custom dtypes as well?). This type checking / coercion is done on the column level and is very quick.
If you want to do more specific type-checking than those available as pandas dtypes (like checking that an entry is a UUID
type), then you'll have to use row-wise validators, which, fair warning, can be quite slow.
To accomplish this using a standard pydantic model you can follow the documentation linked above and define a row-wise validation to do. Your specific example would look something like:
from pydantic import BaseModel
import pandera as pa
import uuid
import pandas as pd
from pandera.engines.pandas_engine import PydanticModel
class Record(BaseModel):
a: uuid.UUID
b: int
class PydanticSchema(pa.SchemaModel):
"""Pandera schema using the pydantic model."""
class Config:
"""Config with dataframe-level data type."""
dtype = PydanticModel(Record)
coerce = True # this is required, otherwise a SchemaInitError is raised
Trying out this should fail:
df = pd.DataFrame({'a': ['not_uuid', uuid.uuid4()], 'b': [3, 4]})
PydanticSchema.validate(df)
A more verbose method would be to define custom checks for each column that checks the type.
I agree that it would be great and intuitive to be able to pass the types directly to pa.Series
similar to how pydantic works. As long as it didn't greatly slow down the validation.
thanks for answering @dantheand !
I agree that it would be great and intuitive to be able to pass the types directly to pa.Series
Indeed, this would be the ideal UX, but there are a few technical issues that are blocking this, the main one being that the pydantic type system works for python types, not for pandas/numpy types... the pandera type system in theory would allow us to add support for pydantic types, this would definitely be something I'd support contributions for! @jeffzi what do you think?
I'll keep this issue open to track this feature, I think it would be super valuable, though it seems like a fairly heavy lift.
@ghilesmeddour @dantheand besides UUID4
are there any other pydantic types that you'd like support for? We can keep a running list and prioritize high-value ones to implement.
@cosmicBboy UUID
isn't a pydantic type as far as I know, it's a type from the uuid
library. It actually think the current dtype = PydanticModel(Record)
solution above already provides the required functionality. It's mostly a UX ask here. It would be awesome to be able to combine pydantic-style subtyping like List[UUID]
with the powerful validation methods provided by pa.Field()
.
Although, maybe it's best not to encourage people to use put arbitrary types in pandas dataframe columns...
Although, maybe it's best not to encourage people to use put arbitrary types in pandas dataframe columns...
Agreed ! In the case of uuid, it would be more efficient to store it as a string and have a check on top of it to validate its format. I'm pretty sure @ghilesmeddour does not care about having a UUID object, and actually wants to ensure the data in the column is a valid UUID format.
We recently introduced the concept of logical data type (still in dev
branch), allowing to
a. support non-native pandas type, e.g. Date (pandas only supports datetime), Decimal b. create commonly used types for better UX. Think of it as a type with an embedded check, and the ability to coerce which checks can't do. e.g.: Path, uuid, url
the pydantic type system works for python types, not for pandas/numpy types... the pandera type system in theory would allow us to add support for pydantic types, this would definitely be something I'd support contributions for! @jeffzi what do you think?
Pydantic does have "true" custom types such as Color or URL which are not based on the standard lib like UUID. Here is what I propose:
- Create logical data types counterparts to the pydantic custom types, implemented with vectorized pandas operations for maximum efficiency.
- Registers pydantic types as equivalent to the new logical data types. Pandera already supports registering parametrized data type so I don't see any blockers for that approach to work.
If we agree on this approach, the next step is to create an issue that references all the pydantic types we plan on supporting. The list of pydantic types can be found here.
I'll keep this issue open to track this feature, I think it would be super valuable, though it seems like a fairly heavy lift.
Absolutely! I'd love to add transformation capabilities to Pandera where it makes sense (not in checks imho :clown_face:). Those logical types would be very convenient for end users and fits in the current pandera approach.
Agreed ! In the case of uuid, it would be more efficient to store it as a string and have a check on top of it to validate its format. I'm pretty sure @ghilesmeddour does not care about having a UUID object, and actually wants to ensure the data in the column is a valid UUID format.
For me, the question came up when trying to use pandera with FastAPI. For Pandera to integrate seamlessly with FastAPI (which is completely based on Pydantic for types validation and documentation), Pandera would have to be able to "recognise" all Pydantic types in my opinion.
For the actual representation of values, I guess that it will be more natural for the user to keep the same behaviour as Pydantic when possible (which coerce for the case of UUID).
import uuid
from pydantic import validate_arguments, UUID4
@validate_arguments
def foo(a: UUID4):
print(type(a))
foo(uuid.uuid4())
foo(str(uuid.uuid4()))
<class 'uuid.UUID'>
<class 'uuid.UUID'>
@ghilesmeddour Thanks for your input! The steps I proposed will enable Pandera to recognize Pydantic types in the same way Pandera recognizes multiple integer types. Internally, Pandera has its own data types that are mapped to the recognized types.
It would look like this (SchemaModel api would work too):
import pandera as pa
from pydantic import UUID4
import numpy as np
from uuid import UUID
schema = pa.DataFrameSchema(
{
"uuid_a": pa.Column(UUID4), # internally mapped to pandera.dtypes.UUID
"uuid_b": pa.Column(UUID),
# recognized int types
"col1": pa.Column(int),
"col2": pa.Column(pa.Int),
"col3": pa.Column(np.int64),
"col4": pa.Column("int"),
},
coerce=True,
)
The same logic would apply to a Path
or URL
type.
For the actual representation of values, I guess that it will be more natural for the user to keep the same behaviour as Pydantic when possible (which coerce for the case of UUID).
uuid.UUUID
objects would be wrapped in a pandas object
column. It's not efficient because you won't be able to use vectorized string operations. I would encode as strings and, for seamless integration with fastapi, we could customize how Pandera exports these types (to/from configuration).
For example, I think to_format = "dict"
should output uuid.UUID
objects but to_format = "json"
should export strings. @cosmicBboy For that to work, we would have to allow Pandera DataTypes to customize export on top of their current responsibilities of customizing check and coerce.
@jeffzi I really like the proposed feature. Until your comments, I didn't realize how much effort went into allowing arbitrary custom dtypes in pandera (e.g. choosing pandas-compatible column dtypes, parallelized checks, decisions on output format, etc.) It now makes sense why you wouldn't natively support arbitrary dtypes in pandas columns. The currently available pydantic record model method allows arbitrary types and is very slow.
There are also the pydantic constrained types like NegativeFloat, NegativeInt, PositiveFloat, PositiveInt
which provide a concise and clear syntax, Most of them should be easily implementable with current check functions.
I love this idea and appreciate the proposal provided! There are many situations where I use the wonderful pydantic constrained types or define a custom type. It is possible to create the same validation with pandera
checks, but would be great to reuse the same pydantic types that are used elsewhere in the codebase. Would think there exists some solution that uses __get_validators__
methods, though it would be tricky to separate this out into check
and coerce
functionality as validation and coercion are typically both performed in __get_validators__
.
Example
A (somewhat) typical situation:
Define custom pydantic type(s)
from datetime import date, datetime
import pandas as pd
from pydantic.types import ConstrainedInt
from pydantic.validators import number_size_validator
def to_epoch_time(v) -> int:
if isinstance(v, date):
return int(v.strftime("%s"))
if isinstance(v, (int, float) or pd.api.types.is_numeric_dtype(v)):
return int(v)
raise TypeError(f"{v} is not a valid date or numeric type")
class EpochTime(ConstrainedInt):
"""Number of seconds since 1970"""
@classmethod
def __get_validators__(cls):
yield to_epoch_time
yield number_size_validator
class PastEpochTime(EpochTime):
lt = datetime.now().timestamp()
class FutureEpochTime(EpochTime):
gt = datetime.now().timestamp()
Currently
What (I think) is required to do this currently:
from pandera import Field, SchemaModel, dtypes
from pandera.engines import pandas_engine
from pandera.typing import Series
# Create custom data type
@pandas_engine.Engine.register_dtype
@dtypes.immutable
class EpochTime(pandas_engine.INT64):
def coerce(self, series: pd.Series) -> pd.Series:
if pd.api.types.is_datetime64_any_dtype(series):
series = series.map(to_epoch_time)
return series.astype("int64")
class MySchema(SchemaModel):
timestamp: Series[EpochTime] = Field(lt=datetime.now().timestamp(), coerce=True)
...
Future??
class MySchema(SchemaModel):
timestamp: Series[PastEpochTime]
There are likely a ton of complexities I'm unaware of, and I recognize this is likely not in scope for this issue, as adding support for pydantic
defined types covers nearly all use cases. That said, if there was enough interest to look into supporting pydantic
custom types, I'd be happy to collaborate on such a feature.
Is this feature-request dead?
Hello according to pandera docs you are available (at least now) to import pydantic model into pandera schema (and vice versa).
Maybe you can solve your problem with this
import uuid
from pydantic import UUID4,BaseModel
import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandera.engines.pandas_engine import PydanticModel
class MyRecord(BaseModel):
a:UUID4
b:int
class InputSchema(pa.DataFrameModel):
"""Pandera schema using the pydantic model."""
class Config:
"""Config with dataframe-level data type."""
dtype = PydanticModel(MyRecord)
coerce = True # this is required, otherwise a SchemaInitError is raised
df = pd.DataFrame({'a': [uuid.uuid4(), uuid.uuid4()], 'b': [3, 4]})
InputSchema.validate(df)
What is cool with this combo is that you can reuse InputSchema
in an other pydantic model
from pandera.typing import DataFrame
class MyCompleteData(BaseModel):
metadata:str
data:DataFrame[InputSchema]
good_pydantic_object = MyCompleteData(
data=pd.DataFrame({'a': [uuid.uuid4(), uuid.uuid4()], 'b': [3, 4]}),
metadata="Will work"
)
print(good_pydantic_object)
bad_pydantic_object = MyCompleteData(
data=pd.DataFrame({'a': [uuid.uuid4(),"NOT A UUID"], 'b': [3, 4]}),
metadata="validation error should occured"
)
Hope this help 👍
I have a requirement where a pydantic model is a column in a dataframe along with other columns. Currently the last line in following code gives pandera.errors.SchemaInitError: PydanticModel dtype can only be specified as a DataFrameSchema dtype.
error.
import pandas as pd
from pydantic import BaseModel
from pandera import DataFrameModel
from pandera.typing import Series
from pandera.engines.pandas_engine import PydanticModel
class MyModel(BaseModel):
x: int
y: str
class MySchema(DataFrameModel):
a: Series[int]
b: Series[PydanticModel(MyModel)]
class Config:
strict = "filter"
coerce = True
df = pd.DataFrame({'a': [1,2], 'b':[{'x': 11, 'y': 'data1'}, {'x': 22, 'y': 'data2'}]})
df = MySchema.validate(df)
I want MySchema.validate(df)
to coerce column "b" in each row to MyModel type like: