pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Regression between Pandera 0.25.0 and 0.26.0 in MultiIndex validation

Open Owen-OptiGrid opened this issue 5 months ago • 2 comments

Describe the bug In Pandera 0.26.0 validation for multindexes that used to pass in Pandera 0.25.0 have started failing

  • [x] I have checked that this issue has not already been reported.
  • [x] I have confirmed this bug exists on the latest version of pandera.
  • [] (optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
from pandera.engines.pandas_engine import DateTime
import pandera.pandas as pa
from pandera.typing import Index, Series


def test_multiindex_empty_df_behaviour_changes() -> None:
    class MultiIndexedTimeFrame(pa.DataFrameModel):
        IDX_1: Index[DateTime] = pa.Field(
            nullable=False,
            dtype_kwargs={
                "unit": "ns",
                "tz": "America/New_York",
                "time_zone_agnostic": True,
            },
        )
        IDX_2: Index[DateTime] = pa.Field(
            nullable=False,
            dtype_kwargs={
                "unit": "ns",
                "tz": "America/New_York",
                "time_zone_agnostic": True,
            },
        )

        @pa.check("IDX_2")
        def _check_same_length_per_idx2(self, idx2_datetimes: Series) -> bool:
            return len(idx2_datetimes.groupby("IDX_2").size().unique()) == 1

        class Config:
            multiindex_unique = ("IDX_1", "IDX_2")
            strict = True

    df_rolling = pd.DataFrame(
        {
            "IDX_1": pd.date_range(
                "2024-06-03 20:20:00",
                "2024-06-03 21:15:00",
                freq="5min",
                tz="America/New_York",
            ),
            "IDX_2": [
                pd.Timestamp("2024-06-03 20:15:00", tz="America/New_York")
            ]
            * 12,
        }
    ).set_index(["IDX_1", "IDX_2"])

    # Succeeds in Pandera 0.25.0, errors with
    # pandera.errors.SchemaError: KeyError("IDX_2") in Pandera 0.26.0
    validated_df = MultiIndexedTimeFrame.validate(df_rolling)

Expected behavior

I expected validation to pass in this case, but it failed.

Desktop (please complete the following information):

  • OS: Ubuntu 22.04.5 LTS

Error shown

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <pandera.api.base.error_handler.ErrorHandler object at 0x7047c9712cf0>
error_type = <ValidationScope.DATA: 'data'>
reason_code = <SchemaErrorReason.CHECK_ERROR: 'check_error'>
schema_error = SchemaError('KeyError("IDX_2")')
original_exc = KeyError('IDX_2')

    def collect_error(
        self,
        error_type: ErrorCategory,
        reason_code: Optional[SchemaErrorReason],
        schema_error: SchemaError,
        original_exc: Union[BaseException, None] = None,
    ):
        """Collect schema error, raising exception if lazy is False.
    
        :param error_type: type of error
        :param reason_code: string representing reason for error
        :param schema_error: ``SchemaError`` object.
        """
        if not self._lazy:
>           raise schema_error from original_exc
E           pandera.errors.SchemaError: KeyError("IDX_2")

.venv/lib/python3.13/site-packages/pandera/api/base/error_handler.py:54: SchemaError
=========================== short test summary info ============================
FAILED tests/test_pandera.py::test_multiindex_empty_df_behaviour_changes - pa...
============================== 1 failed in 0.43s ===============================
Finished running tests!

Owen-OptiGrid avatar Aug 15 '25 04:08 Owen-OptiGrid

hi @Owen-OptiGrid it looks like this will be addressed by https://github.com/unionai-oss/pandera/pull/2114

cosmicBboy avatar Aug 15 '25 18:08 cosmicBboy

I'm not entirely convinced that the new behavior here is wrong, but I guess the old behavior makes sense given how the old backend worked. The check that is scoped to IDX2 seems to assume access to a series that has IDX2 as its index. A check defined on a column would raise an analogous error if it tried to groupby the column name, and the check works as intended if defined as a dataframe_check. Or defining the check this way would also work:

        @pa.check("IDX_2")
        def _check_same_length_per_idx2(self, idx2_datetimes: Series) -> bool:
            return len(idx2_datetimes.groupby(idx2_datetimes).size().unique()) == 1

That said, I've opened #2116 to restore previous behavior.

amerberg avatar Aug 15 '25 20:08 amerberg