pandera
pandera copied to clipboard
Checks with element_wise = True while using PolarsCheckBackend are called twice.
Describe the bug This may be an issue either within python polars or potentially rust polars, but figured I'd start here. I also posted on Stack Overflow here
I'm using pandera 0.19.3 with the polars 0.20.31 backend. While running my schema validation with cProfile, I noticed that all my custom validation checks are being called twice. The check functions when using pandas backend are only called once. I do not have any schema/container wide checks, just column/component checks.
As stated in the Stack Overflow post, I traced this down to a polars.expr.expr.py file at line 4837 calling self._pyexpr.map_batches which wraps a call to col('uid').map_list() (I believe a rust function), and this is where the check function is getting called twice, eventually in the polars.series.series.py at line 5518.
So I am not positive if this is an issue with Pandera or with polars. But figured I'd start here.
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of pandera.
- [x] (optional) I have confirmed this bug exists on the main branch of pandera.
Code Sample, a copy-pastable example
import polars as pl
import pandera as pd
import re
from copy import deepcopy
from typing import Dict
from pandera.polars import Column
from typing import Any, Callable, Type
from pandera import Check
from pandera.backends.base import BaseCheckBackend
from pandera.backends.polars.checks import PolarsCheckBackend
from pandera.errors import SchemaErrors
import pandera.polars as pa
def get_template() -> Dict:
return deepcopy(_schema_template)
def has_valid_format(value: str, regex: str) -> bool:
return bool(re.match(regex, value))
class MyCheck(Check):
def __init__(
self,
check_fn: Callable,
id: str,
name: str,
description: str,
severity: str,
scope: str,
**check_kwargs
):
self.severity = severity
self.scope = scope
super().__init__(check_fn, title=id, name=name, description=description, **check_kwargs)
@classmethod
def get_backend(cls, check_obj: Any) -> Type[BaseCheckBackend]:
return PolarsCheckBackend
_schema_template = {
"uid": Column(
str,
title="Field 1: Unique identifier",
checks=[MyCheck(
has_valid_format,
id="E0002",
name="uid.invalid_text_pattern",
description="Checks the format",
severity="Error",
scope="Syntax",
element_wise=True,
regex="^[A-Z0-9]+$",
)],
),
"action": Column(
str,
title="Action Taken",
checks=[],
)
}
def validate_schedule():
data = {"uid": ["A12B"], "action": ["stop"]}
df = pl.DataFrame(data)
schema = pa.DataFrameSchema(get_template())
try:
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
schema.validate(df, lazy=True)
profiler.disable()
pstats.Stats(profiler).strip_dirs().sort_stats('cumulative').print_stats("has_valid_format")
print("Passed validation")
except SchemaErrors as err:
for schema_error in err.schema_errors:
print(f"Schema Error: {schema_error}, Check: {schema_error.check}")
if __name__ == '__main__':
validate_schedule()
Expected behavior
Doing this with polars results in :
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.000 0.000 0.000 0.000 sample_issue.py:24(has_valid_format)
whereas switching it to use pandas and pandera.pandas results in:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.000 0.000 sample_issue.py:24(has_valid_format)
I would expect only a single call to the check function in the above example code.
Desktop (please complete the following information):
- OS:
- ProductName: macOS
- ProductVersion: 13.6.3
- BuildVersion: 22G436
- Browser: N/A (python command line)
I've found the culprit. It seems using element_wise = True in a Check is causing the double call. If I change that to "groupby='uid'" and change the function to take grouped_data, the check function is only called once.
I furthered this example by adding the following to the schema:
"app_date": Column(
str,
title="Field 2: Application date",
checks=[pa.Check(is_date, element_wise=True)],
),
And the is_date check function also gets called twice, so it's not just custom checks. If I switch that Check to use groupby="app_date" then the check function gets called once.
Thanks for unearthing this @jcadam14 ! would you mind making a PR to fix this?