Pydantic compatibility issue
I believe that the latest versions of Pydantic and Pandera are not fully compatible.
This relates to https://github.com/unionai-oss/pandera/issues/1395 which was closed, but I think should still be open
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
This code throws an error:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame, Series
import pydantic
class SimpleSchema(pa.DataFrameModel):
str_col: Series[str] = pa.Field(unique=True)
class PydanticModel(pydantic.BaseModel):
x: int
df: DataFrame[SimpleSchema]
print(PydanticModel.model_json_schema())
error message:
Exception has occurred: PydanticInvalidForJsonSchema
Cannot generate a JsonSchema for core_schema.PlainValidatorFunctionSchema ({'type': 'no-info', 'function': functools.partial(<bound method DataFrame.pydantic_validate of <class 'pandera.typing.pandas.DataFrame'>>, schema_model=SimpleSchema)})
For further information visit https://errors.pydantic.dev/2.7/u/invalid-for-json-schema
File "C:\LocalTemp\Repos\RA\RiskCalcs\scratch.py", line 18, in <module>
print(PydanticModel.model_json_schema())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic.errors.PydanticInvalidForJsonSchema: Cannot generate a JsonSchema for core_schema.PlainValidatorFunctionSchema ({'type': 'no-info', 'function': functools.partial(<bound method DataFrame.pydantic_validate of <class 'pandera.typing.pandas.DataFrame'>>, schema_model=SimpleSchema)})
For further information visit https://errors.pydantic.dev/2.7/u/invalid-for-json-schema
I have tried various config options to get around this error to no avail.
- OS: Windows
- Pydantic version: 2.7.3
- Pandera version: 0.19.3
Here is my real hacky workaround (no idea if it is right):
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as _DataFrame, Series
from pydantic_core import core_schema, CoreSchema
from pydantic import GetCoreSchemaHandler, BaseModel
from typing import TypeVar, Generic, Any
T = TypeVar("T")
class DataFrame(_DataFrame, Generic[T]):
@classmethod
def __get_pydantic_core_schema__(
cls, source_type: Any, handler: GetCoreSchemaHandler
) -> CoreSchema:
schema = source_type().__orig_class__.__args__[0].to_schema()
type_map = {
"str": core_schema.str_schema(),
"int64": core_schema.int_schema(),
"float64": core_schema.float_schema(),
"bool": core_schema.bool_schema(),
"datetime64[ns]": core_schema.datetime_schema()
}
return core_schema.list_schema(
core_schema.typed_dict_schema(
{
i:core_schema.typed_dict_field(type_map[str(j.dtype)]) for i,j in schema.columns.items()
},
)
)
class SimpleSchema(pa.DataFrameModel):
str_col: Series[str]
class PydanticModel(BaseModel):
x: int
df: DataFrame[SimpleSchema]
@riziles @cosmicBboy any update on this pydantic compatibility issue with json schema and a possible fix in pandera? I am running into this same error in pandera 0.22.1. Looks like the fix PR did not get merged.
Looks like https://github.com/unionai-oss/pandera/pull/1704 addresses this, but it still has CI test errors
any update on this. This issue blocks generating docs page for fastapi.
@ragrawal , you're welcome to take a swing at figuring out why some tests are failing. I don't have the bandwidth to work on this right now.
@riziles -- I looked into the PR and not able to get it working. I am having trouble setting up the development environment. Also I don't think the PR is generic enough. It is trying to handle very special case. I don't have in-depth understanding of pydantic or pandera. Will appreciate if someone can suggest any other hack to get past the above issue
@ragrawal ,
This works:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from pydantic import BaseModel, WithJsonSchema
from typing import Annotated
from fastapi import FastAPI
app = FastAPI()
class SimpleSchema(pa.DataFrameModel):
str_col: Series[str]
class PydanticModel3(BaseModel):
y: Annotated[
DataFrame[SimpleSchema],
WithJsonSchema(SimpleSchema.to_json_schema()),
]
@app.post("/input_api")
def input_this(pm3:PydanticModel3) -> list[str]:
return pm3.y["str_col"].to_list()
...if you specify a to_format in your Panera config then you can output a dataframe, too:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from pydantic import BaseModel, WithJsonSchema
from typing import Annotated
from fastapi import FastAPI
app = FastAPI()
class SimpleSchema(pa.DataFrameModel):
str_col: Series[str]
class Config:
to_format = "dict"
class PydanticModel3(BaseModel):
y: Annotated[
DataFrame[SimpleSchema],
WithJsonSchema(SimpleSchema.to_json_schema()),
]
@app.post("/input_api")
def input_this(pm3:PydanticModel3) -> PydanticModel3:
return pm3
... also, you can just use Annotated directly with FastAPI. You don't need to nest it in a Pydantic object:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema
from fastapi import FastAPI
app = FastAPI()
class SimpleSchema(pa.DataFrameModel):
str_col: Series[str]
class Config:
to_format = "dict"
@app.post("/input_api")
def input_this(
pm3: Annotated[
DataFrame[SimpleSchema],
WithJsonSchema(SimpleSchema.to_json_schema()),
],
) -> Annotated[
DataFrame[SimpleSchema],
WithJsonSchema(SimpleSchema.to_json_schema()),
]:
return pm3
Thanks @riziles .. this works great. Wondering do you know how can provide input data in "records" format. I tried adding
from_format = "dict"
from_format_kwargs = {orient='records'}
However I got this error message: "Value error, Expected 'index', 'columns' or 'tight' for orient parameter. Got 'records' instead",
Below is my full code
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema
from fastapi import FastAPI
app = FastAPI()
class SimpleSchema(pa.DataFrameModel):
col1: Series[str]
col2: Series[int]
class Config:
to_format = "dict"
from_format = "dict"
from_format_kwargs = {"orient": 'records'}
@app.post("/input_api")
def input_this(
pm3: Annotated[
DataFrame[SimpleSchema],
WithJsonSchema(SimpleSchema.to_json_schema()),
],
) -> Annotated[
DataFrame[SimpleSchema],
WithJsonSchema(SimpleSchema.to_json_schema()),
]:
return pm3
@ragrawal , I'd recommend creating your own custom Pydantic class to read in whatever format you want if you don't want to use Pandera's default config. For example, something like this:
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema, BaseModel
from fastapi import FastAPI
app = FastAPI()
class SimpleSchema(pa.DataFrameModel):
str_col: Series[str]
str_col2: Series[str]
class Config:
to_format = "dict"
class InputModel(BaseModel):
str_col: str
str_col2: str
@app.post("/input_api")
def input_this(
pm3: list[InputModel],
) -> list[str]:
df = DataFrame[SimpleSchema](pd.DataFrame([vars(i) for i in pm3]))
print(pm3)
print(type(df))
return df["str_col2"].to_list()
@ragrawal , can we close this issue?
Sure..appreciate your help on this.
@riziles I think the issue is still relevant despite the above workaround since ideally pandera would work without special annotation when generating schema in pydantic and fastapi
Wait a second. Just realizing that I opened this issue. I'm closing it as resolved because this project is awesome and @cosmicBboy probably has better things to work on.
Agree with @eharkins. FastAPI is already very popular and it is likely to become the most popular python web framework in the future. I believe that having full compatibility on documentation generation would be beneficial for pandera usage in production environments.
@riziles let's open it back up! There's a WIP PR that addresses it https://github.com/unionai-oss/pandera/pull/1704 but there are still some unit test issues on it.
@imseananriley not sure if you still have capacity to work on this, if not perhaps someone on the thread can look into making tests pass
imseanriley is preoccupied at the moment. I might be able to throw some resources at it this summer, but I'd much rather focus on killing the Pandas dependency. We're very intent on migrating to Polars/Lance/DuckDB. Right now there is a competing project that has better Polars support: https://github.com/JakobGM/patito . I'd prefer to leave our Pandera models in tact, but not if I have to keep Pandas in our containers.
thanks @riziles, let me digest this feedback. It might be time to do pandera 1.0 and force users to install pandas so that it's not a core dependency.
Right now there is a competing project that has better Polars support
What are some of the deltas you see in patito that are missing in the pandera-polars integration?
In the mean time I'll look into fixing up #1704 to unblock this issue
What are some of the deltas you see in patito that are missing in the pandera-polars integration?
It's just the removal of the Pandas dependency. Pandas is a heavy package that takes up a lot of space when spinning up environments and slows down start times if it needs to be imported.
@ragrawal , I just discovered @cosmicBboy 's PydanticModel adapter here:
https://pandera.readthedocs.io/en/stable/pydantic_integration.html#using-pydantic-models-in-pandera-schemas
Easier way to do what you are looking for:
import pandera as pa
from pandera.typing import DataFrame as DataFrame
from pydantic import BaseModel, TypeAdapter
from pandera.engines.pandas_engine import PydanticModel
from fastapi import FastAPI
app = FastAPI()
class InputModel(BaseModel):
str_col: str
str_col2: str
class SimpleSchema(pa.DataFrameModel):
class Config: # type: ignore
dtype = PydanticModel(InputModel)
coerce = True
@app.post("/test")
def input_this(pm3: list[InputModel]) -> list[str]:
df = DataFrame[SimpleSchema](TypeAdapter(list[InputModel]).dump_python(pm3))
return df["str_col2"].to_list()
Hi @riziles -- Thanks for the suggestion. I have used PydanticModel before and had two concerns
- I am not sure between PydanticModel and DataFrameModel, what is more natively supported within Pandera. I couldn't find good documentation on what is the difference or similarity between the two. Also I read somewhere that PydanticModel tends to be slower as it is evaluating one row at a time.
- I feel PydanticModel has lot of overhead. For instance, instead of a single schema, I have two define two different schema: InputModel and SimpleSchema. When the number of schema expands, this becomes a problem. Using DataFrameModel, I only have to define a single schema and looks cleaner.
@ragrawal , if you want to input row wise data, there's always going to be more overhead. The whole reason Pandas, Polars, Arrow, Lance and DuckDB are so fast is that the data is stored in column vectors.
fixed by #1904
hi @cosmicBboy -- wondering with 1904 now merged, how to simplify the below solution
import pandas as pd
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from typing import Annotated
from pydantic import WithJsonSchema
from fastapi import FastAPI
app = FastAPI()
class SimpleSchema(pa.DataFrameModel):
str_col: Series[str]
class Config:
to_format = "dict"
@app.post("/input_api")
def input_this(
pm3: Annotated[
DataFrame[SimpleSchema],
WithJsonSchema(SimpleSchema.to_json_schema()),
],
) -> Annotated[
DataFrame[SimpleSchema],
WithJsonSchema(SimpleSchema.to_json_schema()),
]:
return pm3
@ragrawal I can test it out and see if we can simplify. Can you share full repro code on starting the server and making a call to the /input_api endpoint?
If I understand correctly, the above code is a workaround to enable proper json schema generation for openapi docs. E.g. running the above code stored in app.py with fastapi dev app.py and checking http://127.0.0.1:8000/docs results in working docs page. With pydantic v1, a simpler definition worked:
from typing import Annotated
import pandera as pa
from pandera.typing import DataFrame as DataFrame, Series
from fastapi import FastAPI, Body
app = FastAPI()
class SimpleSchema(pa.DataFrameModel):
str_col: Series[str]
@app.post("/input_api")
def input_this(
pm3: Annotated[DataFrame[SimpleSchema], Body()],
) -> DataFrame[SimpleSchema]:
return pm3
It's just the removal of the Pandas dependency. Pandas is a heavy package that takes up a lot of space when spinning up environments and slows down start times if it needs to be imported.
Hey @riziles just to follow up here: I made a PR that removes the pandas dependency from polars, and makes it the user's responsibility to install pandas explicitly (or use the pandera[pandas] extra)
https://github.com/unionai-oss/pandera/pull/1926
Thank you @cosmicBboy !