pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Checks with element_wise = True while using PolarsCheckBackend are called twice.

Open jcadam14 opened this issue 1 year ago • 2 comments
trafficstars

Describe the bug This may be an issue either within python polars or potentially rust polars, but figured I'd start here. I also posted on Stack Overflow here

I'm using pandera 0.19.3 with the polars 0.20.31 backend. While running my schema validation with cProfile, I noticed that all my custom validation checks are being called twice. The check functions when using pandas backend are only called once. I do not have any schema/container wide checks, just column/component checks.

As stated in the Stack Overflow post, I traced this down to a polars.expr.expr.py file at line 4837 calling self._pyexpr.map_batches which wraps a call to col('uid').map_list() (I believe a rust function), and this is where the check function is getting called twice, eventually in the polars.series.series.py at line 5518.

So I am not positive if this is an issue with Pandera or with polars. But figured I'd start here.

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of pandera.
  • [x] (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

import polars as pl
import pandera as pd
import re


from copy import deepcopy
from typing import Dict

from pandera.polars import Column

from typing import Any, Callable, Type

from pandera import Check
from pandera.backends.base import BaseCheckBackend
from pandera.backends.polars.checks import PolarsCheckBackend
from pandera.errors import SchemaErrors
import pandera.polars as pa

def get_template() -> Dict:
    return deepcopy(_schema_template)

def has_valid_format(value: str, regex: str) -> bool:
    return bool(re.match(regex, value))


class MyCheck(Check):
    def __init__(
        self,
        check_fn: Callable,
        id: str,
        name: str,
        description: str,
        severity: str,
        scope: str,
        **check_kwargs
    ):
        self.severity = severity
        self.scope = scope

        super().__init__(check_fn, title=id, name=name, description=description, **check_kwargs)

    @classmethod
    def get_backend(cls, check_obj: Any) -> Type[BaseCheckBackend]:
        return PolarsCheckBackend

_schema_template = {
    "uid": Column(
        str,
        title="Field 1: Unique identifier",
        checks=[MyCheck(
                    has_valid_format,
                    id="E0002",
                    name="uid.invalid_text_pattern",
                    description="Checks the format",
                    severity="Error",
                    scope="Syntax",
                    element_wise=True,
                    regex="^[A-Z0-9]+$",
                )],
    ),
    "action": Column(
        str,
        title="Action Taken",
        checks=[],
    )
}

def validate_schedule():
    data = {"uid": ["A12B"], "action": ["stop"]}
    df = pl.DataFrame(data)
    
    schema = pa.DataFrameSchema(get_template())
    try:
        import cProfile
        import pstats
        profiler = cProfile.Profile()
        profiler.enable()
        schema.validate(df, lazy=True)
        profiler.disable()
        pstats.Stats(profiler).strip_dirs().sort_stats('cumulative').print_stats("has_valid_format")
        print("Passed validation")
    except SchemaErrors as err:
        for schema_error in err.schema_errors:
            print(f"Schema Error: {schema_error}, Check: {schema_error.check}")
            
if __name__ == '__main__':
    validate_schedule()

Expected behavior

Doing this with polars results in :

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        2    0.000    0.000    0.000    0.000 sample_issue.py:24(has_valid_format)

whereas switching it to use pandas and pandera.pandas results in:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 sample_issue.py:24(has_valid_format)

I would expect only a single call to the check function in the above example code.

Desktop (please complete the following information):

  • OS:
    • ProductName: macOS
    • ProductVersion: 13.6.3
    • BuildVersion: 22G436
  • Browser: N/A (python command line)

jcadam14 avatar Jun 24 '24 13:06 jcadam14

I've found the culprit. It seems using element_wise = True in a Check is causing the double call. If I change that to "groupby='uid'" and change the function to take grouped_data, the check function is only called once.

I furthered this example by adding the following to the schema:

    "app_date": Column(
        str,
        title="Field 2: Application date",
        checks=[pa.Check(is_date, element_wise=True)],
    ),

And the is_date check function also gets called twice, so it's not just custom checks. If I switch that Check to use groupby="app_date" then the check function gets called once.

jcadam14 avatar Jun 28 '24 15:06 jcadam14

Thanks for unearthing this @jcadam14 ! would you mind making a PR to fix this?

cosmicBboy avatar Jun 28 '24 15:06 cosmicBboy