pandera icon indicating copy to clipboard operation
pandera copied to clipboard

Support polars LazyFrame *actual* lazy validation

Open cosmicBboy opened this issue 1 year ago • 0 comments

Is your feature request related to a problem? Please describe.

First of all, pandera's "lazy" validation is a misnomer. It should really be called raise_eagerly=True or something like that, since all it does is apply all the validation checks before raising a SchemaErrors exception.

With that out of the way, this issue describes supporting polars' lazy evaluation API via LazyFrames. This is a follow-up enhancement once basic polars support is added #1373.

Describe the solution you'd like

By definition pandera needs to do non-lazy operations on the data to to the run-time value checks. Pandera can run metadata checks, e.g. data type checks, column name uniqueness, etc.

This is because the LazyFrame API propagates type information through a lazy query, but it cannot do run-time value checks without materializing the data at validation time.

Therefore, checks that require examining the values of the data to raise an error will do a series of non-lazy operations on the data, ideally in parallel, before raising a runtime error on collect.

Calling schema.validate should run an implicit collect(), and may also do an implicit lazy() to continue the lazy operations.

Proposal

We formalize two modes of validation:

  1. Metadata validation: check metadata such as primitive datatypes, e.g. int64, string, etc.
  2. Data value validation: check actual values.

In the polars programming model, we can do metadata validation before even running the query, but we need to actually run the query to gather the failure cases for data values that don't pass run-time checks (e.g. col >= 0).

In order to lazily raise a data value error, pandera can introduce a namespace:

(
   ldf
   .pandera.validate(schema, collect=False)  # raises metadata errors
   .with_columns(...)  # do stuff
   .pandera.collect()  # this runs the query, raising a data value error.
                       # collect() also materializes a pl.DataFrame
   .lazy()             # convert back to lazy as desired
)

Supporting this would require adding support for lazy evaluation of checks, so instead of CoreCheckResult and CheckResult, it would require a CoreCheckPromise, CheckPromise, which would contain LazyFrames or some other promise of an actual result. These would then be run by calling polars.collect_all() when pandera.collect is invoked.

Describe alternatives you've considered

The alternative would be to simply do a collect() on schema.validate, which is what the basic polars integration would support.

cosmicBboy avatar Oct 31 '23 03:10 cosmicBboy