pandera
pandera copied to clipboard
Bugfix/1446: Ensure Pydantic Models Can Be Created with`typing.pyspark.DataFrame` or `typing.pyspark_sql.DataFrame` Generic
In this PR we resolve the issue reported in #1446, where any Pydantic model with a pandera.typing.pyspark.DataFrame
or pandera.typing.pyspark_sql.DataFrame
always throws a confusing ValidationError
.
For clarity, we want to make sure the following leads to the expected behavior:
import pyspark.sql.types as T
from pandera.pyspark import DataFrameModel, Field
from pandera.typing.pyspark_sql import DataFrame
from pydantic import BaseModel
from pyspark.sql import SparkSession
class SampleSchema(DataFrameModel):
"""
Sample schema model with data checks.
"""
product: T.StringType() = Field()
price: T.IntegerType() = Field()
class PydanticContainer(BaseModel):
"""
Pydantic container with a DataFrameModel as a field.
"""
data: DataFrame[SampleSchema]
class Config:
arbitrary_types_allowed = True
We do this by creating a _PydanticIntegrationMixIn
that can be used by both pandera.typing.pyspark_sql.DataFrame
and pandera.typing.pyspark.DataFrame
.
The content of the mixin is a variation of the methods used in pandera.typing.pandas.DataFrame
.
Note: We assume that any pyspark dataframe used in a pydantic model will be validated eagerly for both pyspark.pandas and pyspark_sql. The default behavior for pyspark_sql dataframes is normally lazy validation, but this makes less sense to me when using a Pydantic model.
Thanks for the PR @brayan07! Looks like there are some lint and unit test errors. Be sure to run tests and setup pre-commit in your dev env to make sure those are passing.
Still running into issues with tests unrelated to new code locally. Will try to resolve before pushing again. Thanks!
I'm getting the same failed tests locally for the main
branch, as well as for this branch, with make nox-conda
. I don't think it's what I added but something in the dev setup. Would it be alright if we ran the CI workflow one more time to help me debug?
Hi @brayan07 sorry for the delayed review on this!
I believe the test errors are coming from from pydantic import GetCoreSchemaHandler
. Will need to move that import into the PYDANTIC_V2
conditional