pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Pydantic compatibility issue

Open riziles opened this issue 1 year ago • 1 comments

I believe that the latest versions of Pydantic and Pandera are not fully compatible.

This relates to https://github.com/unionai-oss/pandera/issues/1395 which was closed, but I think should still be open

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.

This code throws an error:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic

class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str] = pa.Field(unique=True)

class PydanticModel(pydantic.BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

print(PydanticModel.model_json_schema())

error message:

Exception has occurred: PydanticInvalidForJsonSchema
Cannot generate a JsonSchema for core_schema.PlainValidatorFunctionSchema ({'type': 'no-info', 'function': functools.partial(<bound method DataFrame.pydantic_validate of <class 'pandera.typing.pandas.DataFrame'>>, schema_model=SimpleSchema)})

For further information visit https://errors.pydantic.dev/2.7/u/invalid-for-json-schema
  File "C:\LocalTemp\Repos\RA\RiskCalcs\scratch.py", line 18, in <module>
    print(PydanticModel.model_json_schema())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic.errors.PydanticInvalidForJsonSchema: Cannot generate a JsonSchema for core_schema.PlainValidatorFunctionSchema ({'type': 'no-info', 'function': functools.partial(<bound method DataFrame.pydantic_validate of <class 'pandera.typing.pandas.DataFrame'>>, schema_model=SimpleSchema)})

For further information visit https://errors.pydantic.dev/2.7/u/invalid-for-json-schema

I have tried various config options to get around this error to no avail.

  • OS: Windows
  • Pydantic version: 2.7.3
  • Pandera version: 0.19.3

riziles avatar Jun 10 '24 16:06 riziles

Here is my real hacky workaround (no idea if it is right):

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as _DataFrame, Series

from pydantic_core import core_schema, CoreSchema
from pydantic import GetCoreSchemaHandler, BaseModel
from typing import TypeVar, Generic, Any

T = TypeVar("T")  

class DataFrame(_DataFrame, Generic[T]):

    @classmethod
    def __get_pydantic_core_schema__(
        cls, source_type: Any, handler: GetCoreSchemaHandler
    ) -> CoreSchema:

        schema = source_type().__orig_class__.__args__[0].to_schema()

        type_map = {
            "str": core_schema.str_schema(),
            "int64": core_schema.int_schema(),
            "float64": core_schema.float_schema(),
            "bool": core_schema.bool_schema(),
            "datetime64[ns]": core_schema.datetime_schema()
        }

        return core_schema.list_schema(
            core_schema.typed_dict_schema(
                {
                    i:core_schema.typed_dict_field(type_map[str(j.dtype)]) for i,j in schema.columns.items()
                },
            )
        )


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]

class PydanticModel(BaseModel):
    x: int
    df: DataFrame[SimpleSchema]

riziles avatar Jun 10 '24 20:06 riziles

@riziles @cosmicBboy any update on this pydantic compatibility issue with json schema and a possible fix in pandera? I am running into this same error in pandera 0.22.1. Looks like the fix PR did not get merged.

eharkins avatar Jan 23 '25 18:01 eharkins

Looks like https://github.com/unionai-oss/pandera/pull/1704 addresses this, but it still has CI test errors

cosmicBboy avatar Jan 23 '25 19:01 cosmicBboy

any update on this. This issue blocks generating docs page for fastapi.

ragrawal avatar Jan 26 '25 23:01 ragrawal

@ragrawal , you're welcome to take a swing at figuring out why some tests are failing. I don't have the bandwidth to work on this right now.

riziles avatar Jan 27 '25 01:01 riziles

@riziles -- I looked into the PR and not able to get it working. I am having trouble setting up the development environment. Also I don't think the PR is generic enough. It is trying to handle very special case. I don't have in-depth understanding of pydantic or pandera. Will appreciate if someone can suggest any other hack to get past the above issue

ragrawal avatar Jan 28 '25 00:01 ragrawal

@ragrawal ,

This works:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series

from pydantic import BaseModel, WithJsonSchema
from typing import Annotated

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]


class PydanticModel3(BaseModel):
    y: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ]



@app.post("/input_api")
def input_this(pm3:PydanticModel3) -> list[str]:

    return pm3.y["str_col"].to_list()

riziles avatar Jan 28 '25 01:01 riziles

...if you specify a to_format in your Panera config then you can output a dataframe, too:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series

from pydantic import BaseModel, WithJsonSchema
from typing import Annotated

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]

    class Config:
        to_format = "dict"


class PydanticModel3(BaseModel):
    y: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ]



@app.post("/input_api")
def input_this(pm3:PydanticModel3) -> PydanticModel3:

    return pm3

riziles avatar Jan 28 '25 01:01 riziles

... also, you can just use Annotated directly with FastAPI. You don't need to nest it in a Pydantic object:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]

    class Config:
        to_format = "dict"


@app.post("/input_api")
def input_this(
    pm3: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ],
) -> Annotated[
    DataFrame[SimpleSchema],
    WithJsonSchema(SimpleSchema.to_json_schema()),
]:
    return pm3

riziles avatar Jan 28 '25 01:01 riziles

Thanks @riziles .. this works great. Wondering do you know how can provide input data in "records" format. I tried adding

from_format = "dict"
from_format_kwargs = {orient='records'}

However I got this error message: "Value error, Expected 'index', 'columns' or 'tight' for orient parameter. Got 'records' instead",

Below is my full code

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    col1: Series[str]
    col2: Series[int]

    class Config:
        to_format = "dict"
        from_format = "dict"
        from_format_kwargs = {"orient": 'records'}



@app.post("/input_api")
def input_this(
    pm3: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ],
) -> Annotated[
    DataFrame[SimpleSchema],
    WithJsonSchema(SimpleSchema.to_json_schema()),
]:
    return pm3

ragrawal avatar Jan 28 '25 18:01 ragrawal

@ragrawal , I'd recommend creating your own custom Pydantic class to read in whatever format you want if you don't want to use Pandera's default config. For example, something like this:

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema, BaseModel

from fastapi import FastAPI

app = FastAPI()

class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]
    str_col2: Series[str]

    class Config:
        to_format = "dict"

class InputModel(BaseModel):
    str_col: str
    str_col2: str

@app.post("/input_api")
def input_this(
    pm3: list[InputModel],
) -> list[str]:
    
    df = DataFrame[SimpleSchema](pd.DataFrame([vars(i) for i in pm3]))

    print(pm3)
    print(type(df))
    return df["str_col2"].to_list()

riziles avatar Jan 28 '25 20:01 riziles

@ragrawal , can we close this issue?

riziles avatar Jan 30 '25 13:01 riziles

Sure..appreciate your help on this.

ragrawal avatar Jan 30 '25 18:01 ragrawal

@riziles I think the issue is still relevant despite the above workaround since ideally pandera would work without special annotation when generating schema in pydantic and fastapi

eharkins avatar Jan 30 '25 18:01 eharkins

Wait a second. Just realizing that I opened this issue. I'm closing it as resolved because this project is awesome and @cosmicBboy probably has better things to work on.

riziles avatar Jan 30 '25 22:01 riziles

Agree with @eharkins. FastAPI is already very popular and it is likely to become the most popular python web framework in the future. I believe that having full compatibility on documentation generation would be beneficial for pandera usage in production environments.

alejandro-yousef avatar Jan 31 '25 09:01 alejandro-yousef

@riziles let's open it back up! There's a WIP PR that addresses it https://github.com/unionai-oss/pandera/pull/1704 but there are still some unit test issues on it.

@imseananriley not sure if you still have capacity to work on this, if not perhaps someone on the thread can look into making tests pass

cosmicBboy avatar Jan 31 '25 18:01 cosmicBboy

imseanriley is preoccupied at the moment. I might be able to throw some resources at it this summer, but I'd much rather focus on killing the Pandas dependency. We're very intent on migrating to Polars/Lance/DuckDB. Right now there is a competing project that has better Polars support: https://github.com/JakobGM/patito . I'd prefer to leave our Pandera models in tact, but not if I have to keep Pandas in our containers.

riziles avatar Jan 31 '25 21:01 riziles

thanks @riziles, let me digest this feedback. It might be time to do pandera 1.0 and force users to install pandas so that it's not a core dependency.

Right now there is a competing project that has better Polars support

What are some of the deltas you see in patito that are missing in the pandera-polars integration?

cosmicBboy avatar Jan 31 '25 21:01 cosmicBboy

In the mean time I'll look into fixing up #1704 to unblock this issue

cosmicBboy avatar Jan 31 '25 21:01 cosmicBboy

What are some of the deltas you see in patito that are missing in the pandera-polars integration?

It's just the removal of the Pandas dependency. Pandas is a heavy package that takes up a lot of space when spinning up environments and slows down start times if it needs to be imported.

riziles avatar Feb 01 '25 05:02 riziles

@ragrawal , I just discovered @cosmicBboy 's PydanticModel adapter here: https://pandera.readthedocs.io/en/stable/pydantic_integration.html#using-pydantic-models-in-pandera-schemas

Easier way to do what you are looking for:

import pandera as pa
from pandera.typing import DataFrame as DataFrame
from pydantic import BaseModel, TypeAdapter
from pandera.engines.pandas_engine import PydanticModel

from fastapi import FastAPI

app = FastAPI()

class InputModel(BaseModel):
    str_col: str
    str_col2: str


class SimpleSchema(pa.DataFrameModel):
    class Config:  # type: ignore
        dtype = PydanticModel(InputModel)
        coerce = True


@app.post("/test")
def input_this(pm3: list[InputModel]) -> list[str]:
    df = DataFrame[SimpleSchema](TypeAdapter(list[InputModel]).dump_python(pm3))

    return df["str_col2"].to_list()

riziles avatar Feb 02 '25 19:02 riziles

Hi @riziles -- Thanks for the suggestion. I have used PydanticModel before and had two concerns

  1. I am not sure between PydanticModel and DataFrameModel, what is more natively supported within Pandera. I couldn't find good documentation on what is the difference or similarity between the two. Also I read somewhere that PydanticModel tends to be slower as it is evaluating one row at a time.
  2. I feel PydanticModel has lot of overhead. For instance, instead of a single schema, I have two define two different schema: InputModel and SimpleSchema. When the number of schema expands, this becomes a problem. Using DataFrameModel, I only have to define a single schema and looks cleaner.

ragrawal avatar Feb 03 '25 16:02 ragrawal

@ragrawal , if you want to input row wise data, there's always going to be more overhead. The whole reason Pandas, Polars, Arrow, Lance and DuckDB are so fast is that the data is stored in column vectors.

riziles avatar Feb 03 '25 18:02 riziles

fixed by #1904

cosmicBboy avatar Feb 12 '25 15:02 cosmicBboy

hi @cosmicBboy -- wondering with 1904 now merged, how to simplify the below solution

import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema

from fastapi import FastAPI

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]

    class Config:
        to_format = "dict"


@app.post("/input_api")
def input_this(
    pm3: Annotated[
        DataFrame[SimpleSchema],
        WithJsonSchema(SimpleSchema.to_json_schema()),
    ],
) -> Annotated[
    DataFrame[SimpleSchema],
    WithJsonSchema(SimpleSchema.to_json_schema()),
]:
    return pm3

ragrawal avatar Feb 12 '25 18:02 ragrawal

@ragrawal I can test it out and see if we can simplify. Can you share full repro code on starting the server and making a call to the /input_api endpoint?

cosmicBboy avatar Feb 12 '25 18:02 cosmicBboy

If I understand correctly, the above code is a workaround to enable proper json schema generation for openapi docs. E.g. running the above code stored in app.py with fastapi dev app.py and checking http://127.0.0.1:8000/docs results in working docs page. With pydantic v1, a simpler definition worked:

from typing import Annotated
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series

from fastapi import FastAPI, Body

app = FastAPI()


class SimpleSchema(pa.DataFrameModel):
    str_col: Series[str]


@app.post("/input_api")
def input_this(
    pm3: Annotated[DataFrame[SimpleSchema], Body()],
) -> DataFrame[SimpleSchema]:
    return pm3

vilmar-hillow avatar Feb 13 '25 15:02 vilmar-hillow

It's just the removal of the Pandas dependency. Pandas is a heavy package that takes up a lot of space when spinning up environments and slows down start times if it needs to be imported.

Hey @riziles just to follow up here: I made a PR that removes the pandas dependency from polars, and makes it the user's responsibility to install pandas explicitly (or use the pandera[pandas] extra)

https://github.com/unionai-oss/pandera/pull/1926

cosmicBboy avatar Mar 07 '25 15:03 cosmicBboy

Thank you @cosmicBboy !

riziles avatar Mar 07 '25 17:03 riziles