pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Cannot create a pydantic model with a `pandera.typing.pyspark.DataFrame` type.

Open brayan07 opened this issue 1 year ago • 5 comments

Describe the bug A clear and concise description of what the bug is.

Pydantic models always throw is_instance_of validation errors if a pandera.typing.pyspark.DataFrame type is used. Pydantic integration with pyspark dataframes is broken.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.
  • [x] (optional) I have confirmed this bug exists on the master branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pyspark.sql.types as T

from pandera.pyspark import DataFrameModel, Field
from pandera.typing.pyspark import DataFrame
from pydantic import BaseModel
from pyspark.sql import SparkSession


class SampleSchema(DataFrameModel):
    """
    Sample schema model with data checks.
    """

    product: T.StringType() = Field()
    price: T.IntegerType() = Field()


class PydanticContainer(BaseModel):
    """
    Pydantic container with a DataFrameModel as a field.
    """

    data: DataFrame[SampleSchema]

    class Config:
        arbitrary_types_allowed = True


data = [("Bread", 9), ("Butter", 15)]
schema = (
    T.StructType(
        [
            T.StructField("product", T.StringType()),
            T.StructField("price", T.IntegerType()),
        ],
    )
)

spark = SparkSession.builder.appName("Pandera Pyspark Testing").getOrCreate()
data_df = spark.createDataFrame(data, schema=schema)

# Instantiating the PydanticContainer leads to a ValidationError
my_container = PydanticContainer(data=data_df)

The above leads to the following error:

tests/pyspark/test_scratch.py:38 (test_run)
def test_run():
        spark = SparkSession.builder.appName("Pandera Pyspark Testing").getOrCreate()
        data_df = spark.createDataFrame(data, schema=schema)
>       my_container = PydanticContainer(data=data_df)
E       pydantic_core._pydantic_core.ValidationError: 1 validation error for PydanticContainer
E       data
E         Input should be an instance of DataFrame [type=is_instance_of, input_value=DataFrame[product: string, price: int], input_type=DataFrame]
E           For further information visit https://errors.pydantic.dev/2.5/v/is_instance_of

test_scratch.py:42: ValidationError

Expected behavior

A clear and concise description of what you expected to happen. We would expect the PydanticContainer to instantiate successfully. The error says that the DataFrame we're feeding in is not a DataFrame.

Desktop (please complete the following information):

  • OS: [e.g. iOS] MacOS Ventura 13.5
  • Browser [e.g. chrome, safari] Chrome
  • Version [e.g. 22] 119.0.6045.199

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Add any other context about the problem here.

brayan07 avatar Dec 12 '23 14:12 brayan07

The pyspark.sql pandera backend does not currently support pydantic types. The current behavior is designed to only work with pyspark types.

Going to change this to an enhancement ticket, will need discussion with the defacto code owners for the pyspark.sql integration: @NeerajMalhotra-QB @jaskaransinghsidana.

cosmicBboy avatar Dec 12 '23 15:12 cosmicBboy

Ah, okay I misread this issue! You want to use a pandera pyspark.sql schema in your pydantic models, correct? This should actually work, reverting this to a bug.

Open to contributions for this.

cosmicBboy avatar Dec 12 '23 15:12 cosmicBboy

Just looking at the code above I suspect the issue is your import from pandera.typing.pyspark import DataFrame which might be pointing to pyspark.pandas.DataFrame and not PySpark Sql. I haven't digged into this but it appears to be the issue to me.

NeerajMalhotra-QB avatar Dec 12 '23 20:12 NeerajMalhotra-QB

I get the same error with both: from pandera.typing.pyspark import DataFrame and from pandera.typing.pyspark_sql import DataFrame

I have a fix working locally and will submit a PR for this in the next couple of days.

brayan07 avatar Dec 13 '23 14:12 brayan07

Sumbitted a bugfix in #1447 for review.

brayan07 avatar Dec 15 '23 15:12 brayan07