pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Polars checks not being evaluated correctly

Open mxblsdl opened this issue 1 year ago • 3 comments

Describe the bug The column checks on polars LazyFrames are not registering errors when they should. Values outside of a defined range pass validation with no warnings or errors. This is not true for polars DataFrame which does register an error.

It looks like this was addressed in a recent PR but I am still seeing the bug in the 0.19.3 release.

  • [ ] I have checked that this issue has not already been reported.
    • The issue has been reported and merged to main, but is still persisting in the most recent release
  • [ x] I have confirmed this bug exists on the latest version of pandera.
  • [ ] (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample,

# This code is taken from the examples page [here](https://pandera--1373.org.readthedocs.build/en/1373/polars.html)
# With values changed to be outside the define range.

import pandera.polars as pa
import polars as pl


schema = pa.DataFrameSchema(
    {
        "state": pa.Column(str),
        "city": pa.Column(str),
        "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)), # check is defined
    }
)


lf = pl.LazyFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [2, 12, 10, 16, 20, 180], # values outside of defined range are passed
    }
)
print(schema.validate(lf).collect()) # no errors are raised

Expected behavior

I would expect a pandera.errors.SchemaError to be raised. Note that the polars.DataFrame version of this code does raise and error.

import pandera.polars as pa
import polars as pl


schema = pa.DataFrameSchema(
    {
        "state": pa.Column(str),
        "city": pa.Column(str),
        "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)),
    }
)


lf = pl.DataFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [2, 12, 10, 16, 20, 180],
    }
)
print(schema.validate(lf))

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser: Chrome
  • Version: pandera: 0.19.3, polars: 0.20.28

mxblsdl avatar May 30 '24 18:05 mxblsdl

Screenshot 2024-06-12 at 21 03 31

https://pandera.readthedocs.io/en/stable/polars.html#how-it-works

I think this behaviour is expected. pa.Check.in_range(min_value=5, max_value=20) cannot be performed on pl.LazyFrame object as it requires reading of the data.

kacper-sellforte avatar Jun 12 '24 18:06 kacper-sellforte

So are checks never assessed for LazyFrame objects?

I feel like the documentation should make this more explicit or a warning should be issued. The top example comes directly from Pandera documentation and having a check that is never assessed creates a false sense of coverage.

mxblsdl avatar Jun 17 '24 17:06 mxblsdl

Checks are assessed for LazyFrame objects, but only those that don't require data being present in the memory are evaluated - so most importantly data types

kacper-sellforte avatar Jun 18 '24 06:06 kacper-sellforte

This is expected behavior @mxblsdl.

I feel like the documentation should make this more explicit

I believe it already does, see https://pandera.readthedocs.io/en/stable/polars.html#how-it-works already linked by @kacper-sellforte.

or a warning should be issued

This is also a good idea. I think a better logging experience here would be helpful. Would you mind opening up a separate issue for this request?

The correct way to support this would be if polars has a first-class expression that asserts whether a column contains any False values, in which case pandera can catch the error lazily when the lazyframe is evaluated. I opened up an issue in the polars project: https://github.com/pola-rs/polars/issues/16120

cosmicBboy avatar Jul 16 '24 12:07 cosmicBboy

Also see https://pandera.readthedocs.io/en/stable/polars.html#data-level-validation-with-lazyframes. You can set the environment variable export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA and pandera will do a LazyFrame.collect call under the hood and convert back into a LazyFrame.

cosmicBboy avatar Jul 16 '24 13:07 cosmicBboy

okay thank you for taking a look at this. I guess I was just confused on the limits of lazyframe evaluation. I will experiment with the env variable mentioned above and close the issue.

mxblsdl avatar Jul 16 '24 17:07 mxblsdl

Seconding this - seems like very dangerous behavior. Maybe we need to add a big warning to the docs.

Filimoa avatar Nov 29 '24 17:11 Filimoa