pandera
pandera copied to clipboard
Cannot create a pydantic model with a `pandera.typing.pyspark.DataFrame` type.
Describe the bug A clear and concise description of what the bug is.
Pydantic models always throw is_instance_of
validation errors if a pandera.typing.pyspark.DataFrame
type is used. Pydantic integration with pyspark dataframes is broken.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [x] (optional) I have confirmed this bug exists on the master branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pyspark.sql.types as T
from pandera.pyspark import DataFrameModel, Field
from pandera.typing.pyspark import DataFrame
from pydantic import BaseModel
from pyspark.sql import SparkSession
class SampleSchema(DataFrameModel):
"""
Sample schema model with data checks.
"""
product: T.StringType() = Field()
price: T.IntegerType() = Field()
class PydanticContainer(BaseModel):
"""
Pydantic container with a DataFrameModel as a field.
"""
data: DataFrame[SampleSchema]
class Config:
arbitrary_types_allowed = True
data = [("Bread", 9), ("Butter", 15)]
schema = (
T.StructType(
[
T.StructField("product", T.StringType()),
T.StructField("price", T.IntegerType()),
],
)
)
spark = SparkSession.builder.appName("Pandera Pyspark Testing").getOrCreate()
data_df = spark.createDataFrame(data, schema=schema)
# Instantiating the PydanticContainer leads to a ValidationError
my_container = PydanticContainer(data=data_df)
The above leads to the following error:
tests/pyspark/test_scratch.py:38 (test_run)
def test_run():
spark = SparkSession.builder.appName("Pandera Pyspark Testing").getOrCreate()
data_df = spark.createDataFrame(data, schema=schema)
> my_container = PydanticContainer(data=data_df)
E pydantic_core._pydantic_core.ValidationError: 1 validation error for PydanticContainer
E data
E Input should be an instance of DataFrame [type=is_instance_of, input_value=DataFrame[product: string, price: int], input_type=DataFrame]
E For further information visit https://errors.pydantic.dev/2.5/v/is_instance_of
test_scratch.py:42: ValidationError
Expected behavior
A clear and concise description of what you expected to happen.
We would expect the PydanticContainer to instantiate successfully. The error says that the DataFrame
we're feeding in is not a DataFrame
.
Desktop (please complete the following information):
- OS: [e.g. iOS] MacOS Ventura 13.5
- Browser [e.g. chrome, safari] Chrome
- Version [e.g. 22] 119.0.6045.199
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
The pyspark.sql pandera backend does not currently support pydantic types. The current behavior is designed to only work with pyspark types.
Going to change this to an enhancement ticket, will need discussion with the defacto code owners for the pyspark.sql integration: @NeerajMalhotra-QB @jaskaransinghsidana.
Ah, okay I misread this issue! You want to use a pandera pyspark.sql schema in your pydantic models, correct? This should actually work, reverting this to a bug.
Open to contributions for this.
Just looking at the code above I suspect the issue is your import from pandera.typing.pyspark import DataFrame
which might be pointing to pyspark.pandas.DataFrame
and not PySpark Sql
. I haven't digged into this but it appears to be the issue to me.
I get the same error with both:
from pandera.typing.pyspark import DataFrame
and
from pandera.typing.pyspark_sql import DataFrame
I have a fix working locally and will submit a PR for this in the next couple of days.
Sumbitted a bugfix in #1447 for review.