pandera
pandera copied to clipboard
Time-agnostic DateTime with pandera-native polars datatype using DataFrameModel not working
The use of dtype_kwargs for the pandera.engines.polars_engine.DateTime dtype as demonstrated in the documentation examples does not seem to work when using with a DataFrameModel.
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera -> using 0.19.2.
- [ ] (optional) I have confirmed this bug exists on the main branch of pandera.
I tried with all 3 variations of the DataFrameModel, as well as using a DataFrameSchema and only the latter seems to work as expected:
Code Sample:
import datetime as dt
import pandera.polars as pa
import polars as pl
from pandera.engines import polars_engine as pe
df = pl.DataFrame(
schema={
"column_1": pl.Utf8,
"column_2": pl.Datetime(time_zone="UTC"),
},
data={
"column_1": ["value1", "value2"],
"column_2": [dt.datetime(2024, 1, 1), dt.datetime(2024, 1, 2)],
},
)
class SchemaDataFrameModel1(pa.DataFrameModel):
column_1: str
column_2: dt.datetime
class SchemaDataFrameModel2(pa.DataFrameModel):
column_1: str
column_2: pe.DateTime = pa.Field(dtype_kwargs={"time_zone_agnostic": True})
class SchemaDataFrameModel3(pa.DataFrameModel):
column_1: str
column_2: Annotated[pe.DateTime, True]
schema_dataframeschema_1 = pa.DataFrameSchema(
{
"column_1": pa.Column(str),
"column_2": pa.Column(dt.datetime),
}
)
schema_dataframeschema_2 = pa.DataFrameSchema(
{
"column_1": pa.Column(str),
"column_2": pa.Column(pe.DateTime(time_zone_agnostic=True)),
}
)
cases = {
"DataFrameModel (Field) - without `time_zone_agnostic=True`": SchemaDataFrameModel1,
"DataFrameModel (Field) - with `time_zone_agnostic=True": SchemaDataFrameModel2,
"DataFrameModel (Annotated)": SchemaDataFrameModel3,
"DataFrameSchema - without `time_zone_agnostic=True`": schema_dataframeschema_1,
"DataFrameSchema - with `time_zone_agnostic=True`": schema_dataframeschema_2,
}
for case, schema in cases.items():
print(f"Case: {case}")
try:
if type(schema) == pa.DataFrameModel:
schema.to_schema().validate(df)
else:
schema.validate(df)
print("\t✅ Validation successful")
except Exception as e:
print(f"\t❌ Validation Failed: {e}")
Output:
Case: DataFrameModel (Field) - without `time_zone_agnostic=True`
❌ Validation Failed: expected column 'column_2' to have type Datetime(time_unit='us', time_zone=None), got Datetime(time_unit='us', time_zone='UTC')
Case: DataFrameModel (Field) - with `time_zone_agnostic=True
❌ Validation Failed: 'Datetime' object is not callable
Case: DataFrameModel (Annotated)
❌ Validation Failed: Annotation 'DateTime' requires all positional arguments ['time_zone_agnostic', 'time_zone', 'time_unit'].
Case: DataFrameSchema - without `time_zone_agnostic=True`
❌ Validation Failed: expected column 'column_2' to have type Datetime(time_unit='us', time_zone=None), got Datetime(time_unit='us', time_zone='UTC')
Case: DataFrameSchema - with `time_zone_agnostic=True`
✅ Validation successful
Expected behaviour
I expect the validation to fail on schemas that don't provide time_zone_agnostic=True (which is the case), and for it to pass validation when setting time_zone_agnostic=True.
Actual behaviour
The use of pa.Field fails with 'Datetime' object is not callable and the use of Annotated fails with Annotation 'DateTime' requires all positional arguments ['time_zone_agnostic', 'time_zone', 'time_unit'].
For the case of pa.Field, it looks like an instance of pl.Datetime gets returned by engine_dtype = pe.Engine.dtype(annotation.raw_annotation) in _build_columns() of class DataFrameModel, and then called again with dtype(**self.dtype_kwargs) in _get_schema_properties() of class FieldInfo which throws the error.
Desktop
- OS: Ubuntu 22.04 LTS (WSL)
good catch! https://github.com/unionai-oss/pandera/pull/1638 should fix this.
it also updates the docs so that using Annotated types requires passing in all of the pos and kwargs:
class ModelTZAgnosticAnnotated(DataFrameModel):
datetime_col: Annotated[pe.DateTime, True, "us", None] # time_zone_agnostic, unit, time_zone
Thanks a lot for the quick fix!