Polars checks not being evaluated correctly
Describe the bug The column checks on polars LazyFrames are not registering errors when they should. Values outside of a defined range pass validation with no warnings or errors. This is not true for polars DataFrame which does register an error.
It looks like this was addressed in a recent PR but I am still seeing the bug in the 0.19.3 release.
- [ ] I have checked that this issue has not already been reported.
- The issue has been reported and merged to main, but is still persisting in the most recent release
- [ x] I have confirmed this bug exists on the latest version of pandera.
- [ ] (optional) I have confirmed this bug exists on the main branch of pandera.
Code Sample,
# This code is taken from the examples page [here](https://pandera--1373.org.readthedocs.build/en/1373/polars.html)
# With values changed to be outside the define range.
import pandera.polars as pa
import polars as pl
schema = pa.DataFrameSchema(
{
"state": pa.Column(str),
"city": pa.Column(str),
"price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)), # check is defined
}
)
lf = pl.LazyFrame(
{
"state": ["FL", "FL", "FL", "CA", "CA", "CA"],
"city": [
"Orlando",
"Miami",
"Tampa",
"San Francisco",
"Los Angeles",
"San Diego",
],
"price": [2, 12, 10, 16, 20, 180], # values outside of defined range are passed
}
)
print(schema.validate(lf).collect()) # no errors are raised
Expected behavior
I would expect a pandera.errors.SchemaError to be raised. Note that the polars.DataFrame version of this code does raise and error.
import pandera.polars as pa
import polars as pl
schema = pa.DataFrameSchema(
{
"state": pa.Column(str),
"city": pa.Column(str),
"price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)),
}
)
lf = pl.DataFrame(
{
"state": ["FL", "FL", "FL", "CA", "CA", "CA"],
"city": [
"Orlando",
"Miami",
"Tampa",
"San Francisco",
"Los Angeles",
"San Diego",
],
"price": [2, 12, 10, 16, 20, 180],
}
)
print(schema.validate(lf))
Desktop (please complete the following information):
- OS: Windows 10
- Browser: Chrome
- Version: pandera: 0.19.3, polars: 0.20.28
https://pandera.readthedocs.io/en/stable/polars.html#how-it-works
I think this behaviour is expected. pa.Check.in_range(min_value=5, max_value=20) cannot be performed on pl.LazyFrame object as it requires reading of the data.
So are checks never assessed for LazyFrame objects?
I feel like the documentation should make this more explicit or a warning should be issued. The top example comes directly from Pandera documentation and having a check that is never assessed creates a false sense of coverage.
Checks are assessed for LazyFrame objects, but only those that don't require data being present in the memory are evaluated - so most importantly data types
This is expected behavior @mxblsdl.
I feel like the documentation should make this more explicit
I believe it already does, see https://pandera.readthedocs.io/en/stable/polars.html#how-it-works already linked by @kacper-sellforte.
or a warning should be issued
This is also a good idea. I think a better logging experience here would be helpful. Would you mind opening up a separate issue for this request?
The correct way to support this would be if polars has a first-class expression that asserts whether a column contains any False values, in which case pandera can catch the error lazily when the lazyframe is evaluated. I opened up an issue in the polars project: https://github.com/pola-rs/polars/issues/16120
Also see https://pandera.readthedocs.io/en/stable/polars.html#data-level-validation-with-lazyframes. You can set the environment variable export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA and pandera will do a LazyFrame.collect call under the hood and convert back into a LazyFrame.
okay thank you for taking a look at this. I guess I was just confused on the limits of lazyframe evaluation. I will experiment with the env variable mentioned above and close the issue.
Seconding this - seems like very dangerous behavior. Maybe we need to add a big warning to the docs.