Regression between Pandera 0.25.0 and 0.26.0 in MultiIndex validation
Describe the bug In Pandera 0.26.0 validation for multindexes that used to pass in Pandera 0.25.0 have started failing
- [x] I have checked that this issue has not already been reported.
- [x] I have confirmed this bug exists on the latest version of pandera.
- [] (optional) I have confirmed this bug exists on the main branch of pandera.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
import pandas as pd
from pandera.engines.pandas_engine import DateTime
import pandera.pandas as pa
from pandera.typing import Index, Series
def test_multiindex_empty_df_behaviour_changes() -> None:
class MultiIndexedTimeFrame(pa.DataFrameModel):
IDX_1: Index[DateTime] = pa.Field(
nullable=False,
dtype_kwargs={
"unit": "ns",
"tz": "America/New_York",
"time_zone_agnostic": True,
},
)
IDX_2: Index[DateTime] = pa.Field(
nullable=False,
dtype_kwargs={
"unit": "ns",
"tz": "America/New_York",
"time_zone_agnostic": True,
},
)
@pa.check("IDX_2")
def _check_same_length_per_idx2(self, idx2_datetimes: Series) -> bool:
return len(idx2_datetimes.groupby("IDX_2").size().unique()) == 1
class Config:
multiindex_unique = ("IDX_1", "IDX_2")
strict = True
df_rolling = pd.DataFrame(
{
"IDX_1": pd.date_range(
"2024-06-03 20:20:00",
"2024-06-03 21:15:00",
freq="5min",
tz="America/New_York",
),
"IDX_2": [
pd.Timestamp("2024-06-03 20:15:00", tz="America/New_York")
]
* 12,
}
).set_index(["IDX_1", "IDX_2"])
# Succeeds in Pandera 0.25.0, errors with
# pandera.errors.SchemaError: KeyError("IDX_2") in Pandera 0.26.0
validated_df = MultiIndexedTimeFrame.validate(df_rolling)
Expected behavior
I expected validation to pass in this case, but it failed.
Desktop (please complete the following information):
- OS: Ubuntu 22.04.5 LTS
Error shown
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pandera.api.base.error_handler.ErrorHandler object at 0x7047c9712cf0>
error_type = <ValidationScope.DATA: 'data'>
reason_code = <SchemaErrorReason.CHECK_ERROR: 'check_error'>
schema_error = SchemaError('KeyError("IDX_2")')
original_exc = KeyError('IDX_2')
def collect_error(
self,
error_type: ErrorCategory,
reason_code: Optional[SchemaErrorReason],
schema_error: SchemaError,
original_exc: Union[BaseException, None] = None,
):
"""Collect schema error, raising exception if lazy is False.
:param error_type: type of error
:param reason_code: string representing reason for error
:param schema_error: ``SchemaError`` object.
"""
if not self._lazy:
> raise schema_error from original_exc
E pandera.errors.SchemaError: KeyError("IDX_2")
.venv/lib/python3.13/site-packages/pandera/api/base/error_handler.py:54: SchemaError
=========================== short test summary info ============================
FAILED tests/test_pandera.py::test_multiindex_empty_df_behaviour_changes - pa...
============================== 1 failed in 0.43s ===============================
Finished running tests!
hi @Owen-OptiGrid it looks like this will be addressed by https://github.com/unionai-oss/pandera/pull/2114
I'm not entirely convinced that the new behavior here is wrong, but I guess the old behavior makes sense given how the old backend worked. The check that is scoped to IDX2 seems to assume access to a series that has IDX2 as its index. A check defined on a column would raise an analogous error if it tried to groupby the column name, and the check works as intended if defined as a dataframe_check. Or defining the check this way would also work:
@pa.check("IDX_2")
def _check_same_length_per_idx2(self, idx2_datetimes: Series) -> bool:
return len(idx2_datetimes.groupby(idx2_datetimes).size().unique()) == 1
That said, I've opened #2116 to restore previous behavior.