pandera
pandera copied to clipboard
Can you use Pydantic Field Aliasing with Pandera / PydanticModel schema definitions?
How to use Pydantic Field Alias with pandera
I am processing a CSV and I am trying to use Pandera to validate the data. The names in the CSV header row are not what I want the names in my model to be. I haven't figured out how to achieve field aliasing. Any suggestions?
Here is a snippet that reproduces the error I am getting.
import io
import pydantic
import pandas as pd
import pandera as pa
from pandera.engines.pandas_engine import PydanticModel
class AliasedRecord(pydantic.BaseModel):
name: str = pydantic.Field(alias="Name")
amt_in_local: float = pydantic.Field(alias="Amount in local currency")
class AliasDFSchema(pa.DataFrameModel):
"""Pandera schema using the pydantic model."""
class Config:
"""Config with dataframe-level data type."""
dtype = PydanticModel(AliasedRecord)
strict=True
coerce = True # this is required, otherwise a SchemaInitError is raised
# Direct Pydantic Model Validation
ar_m = AliasedRecord.model_validate({"Name":"Foo", "Amount in local currency": 1.32})
print(f"My Model is: {ar_m}")
# Now try validating a DataFrame
# Generate data similar to the source CSV
f = io.StringIO('Name,Amount in local currency\nfoo,1.32\nbar,3.34')
df1 = pd.read_csv(f)
validated_df = AliasDFSchema(df1)
Output
The successful Model:
My Model is: name='Foo' amt_in_local=1.32
The DataFrame / Pandera error ...
... bunch of stuff removed for brevity
SchemaError: column 'Name' not in DataFrameSchema {}
df1 is correctly created
Looks like PydanticModel doesn't interact well with strict=True. This works:
class AliasDFSchema(pa.DataFrameModel):
"""Pandera schema using the pydantic model."""
class Config:
"""Config with dataframe-level data type."""
dtype = PydanticModel(AliasedRecord)
coerce = True # this is required, otherwise a SchemaInitError is raised
One potential fix for this would be to update the DataFrameSchema.__init__ method to special case the case where dtype = PydanticModel. Basically, just pull out the column names/aliases from the pydantic model and create a column dictionary.
Turning this into a bug issue in case anyone wants to open a PR!
I would like to have a crack at this please
One thing that would be nice to add to the pandera/pydantic integration is enabling outputing field aliases. For example, enabling something like PydanticModel(AliasedRecord, by_alias=True). Otherwise I don't think we're able to output a validated dataframe with aliased column names.
As an example of what I'm talking about:
import pandas as pd
import pandera as pa
from pandera.engines.pandas_engine import PydanticModel
from pydantic import BaseModel, Field
class Schema(pa.DataFrameModel):
col_2020: pa.typing.Series[int] = pa.Field(alias="Col 2020")
df = pd.DataFrame({"Col 2020": [99, 100]})
print(Schema.validate(df))
# Col 2020
# 0 99
# 1 100
class SchemaRow(BaseModel):
col_2020: int = Field(..., alias="Col 2020")
class PydanticSchema(pa.DataFrameModel):
class Config:
dtype = PydanticModel(SchemaRow)
coerce = True
print(PydanticSchema.validate(df))
# col_2020
# 0 99
# 1 100
If you make a PydanticModel+Pandera equivalent of a standard Pandera model with an alias, the validation behavior is different, in that the standard Pandera model will retain the column alias whereas the PydanticModel+Pandera version will revert from the field alias to the field name. I had to abandon using the convenient @pa.check_types decorator for some functions in an app I'm working on because of this.