pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Time-agnostic DateTime with pandera-native polars datatype using DataFrameModel not working

Open CasperTeirlinck opened this issue 1 year ago • 2 comments

The use of dtype_kwargs for the pandera.engines.polars_engine.DateTime dtype as demonstrated in the documentation examples does not seem to work when using with a DataFrameModel.

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera -> using 0.19.2.
  • [ ] (optional) I have confirmed this bug exists on the main branch of pandera.

I tried with all 3 variations of the DataFrameModel, as well as using a DataFrameSchema and only the latter seems to work as expected:

Code Sample:


import datetime as dt

import pandera.polars as pa
import polars as pl
from pandera.engines import polars_engine as pe

df = pl.DataFrame(
    schema={
        "column_1": pl.Utf8,
        "column_2": pl.Datetime(time_zone="UTC"),
    },
    data={
        "column_1": ["value1", "value2"],
        "column_2": [dt.datetime(2024, 1, 1), dt.datetime(2024, 1, 2)],
    },
)


class SchemaDataFrameModel1(pa.DataFrameModel):
    column_1: str
    column_2: dt.datetime


class SchemaDataFrameModel2(pa.DataFrameModel):
    column_1: str
    column_2: pe.DateTime = pa.Field(dtype_kwargs={"time_zone_agnostic": True})


class SchemaDataFrameModel3(pa.DataFrameModel):
    column_1: str
    column_2: Annotated[pe.DateTime, True]


schema_dataframeschema_1 = pa.DataFrameSchema(
    {
        "column_1": pa.Column(str),
        "column_2": pa.Column(dt.datetime),
    }
)


schema_dataframeschema_2 = pa.DataFrameSchema(
    {
        "column_1": pa.Column(str),
        "column_2": pa.Column(pe.DateTime(time_zone_agnostic=True)),
    }
)


cases = {
    "DataFrameModel (Field) - without `time_zone_agnostic=True`": SchemaDataFrameModel1,
    "DataFrameModel (Field) - with `time_zone_agnostic=True": SchemaDataFrameModel2,
    "DataFrameModel (Annotated)": SchemaDataFrameModel3,
    "DataFrameSchema - without `time_zone_agnostic=True`": schema_dataframeschema_1,
    "DataFrameSchema - with `time_zone_agnostic=True`": schema_dataframeschema_2,
}

for case, schema in cases.items():
    print(f"Case: {case}")
    try:
        if type(schema) == pa.DataFrameModel:
            schema.to_schema().validate(df)
        else:
            schema.validate(df)
        print("\t✅ Validation successful")
    except Exception as e:
        print(f"\t❌ Validation Failed: {e}")


Output:

Case: DataFrameModel (Field) - without `time_zone_agnostic=True`
        ❌ Validation Failed: expected column 'column_2' to have type Datetime(time_unit='us', time_zone=None), got Datetime(time_unit='us', time_zone='UTC')
Case: DataFrameModel (Field) - with `time_zone_agnostic=True
        ❌ Validation Failed: 'Datetime' object is not callable
Case: DataFrameModel (Annotated)
        ❌ Validation Failed: Annotation 'DateTime' requires all positional arguments ['time_zone_agnostic', 'time_zone', 'time_unit'].
Case: DataFrameSchema - without `time_zone_agnostic=True`
        ❌ Validation Failed: expected column 'column_2' to have type Datetime(time_unit='us', time_zone=None), got Datetime(time_unit='us', time_zone='UTC')
Case: DataFrameSchema - with `time_zone_agnostic=True`
        ✅ Validation successful

Expected behaviour

I expect the validation to fail on schemas that don't provide time_zone_agnostic=True (which is the case), and for it to pass validation when setting time_zone_agnostic=True.

Actual behaviour

The use of pa.Field fails with 'Datetime' object is not callable and the use of Annotated fails with Annotation 'DateTime' requires all positional arguments ['time_zone_agnostic', 'time_zone', 'time_unit'].

For the case of pa.Field, it looks like an instance of pl.Datetime gets returned by engine_dtype = pe.Engine.dtype(annotation.raw_annotation) in _build_columns() of class DataFrameModel, and then called again with dtype(**self.dtype_kwargs) in _get_schema_properties() of class FieldInfo which throws the error.

Desktop

  • OS: Ubuntu 22.04 LTS (WSL)

CasperTeirlinck avatar May 11 '24 12:05 CasperTeirlinck

good catch! https://github.com/unionai-oss/pandera/pull/1638 should fix this.

it also updates the docs so that using Annotated types requires passing in all of the pos and kwargs:

    class ModelTZAgnosticAnnotated(DataFrameModel):
        datetime_col: Annotated[pe.DateTime, True, "us", None]  # time_zone_agnostic, unit, time_zone

cosmicBboy avatar May 11 '24 15:05 cosmicBboy

Thanks a lot for the quick fix!

CasperTeirlinck avatar May 12 '24 17:05 CasperTeirlinck