pandera
pandera copied to clipboard
Support polars LazyFrame *actual* lazy validation
Is your feature request related to a problem? Please describe.
First of all, pandera's "lazy" validation is a misnomer. It should really be called raise_eagerly=True
or something like that, since all it does is apply all the validation checks before raising a SchemaErrors
exception.
With that out of the way, this issue describes supporting polars' lazy evaluation API via LazyFrames. This is a follow-up enhancement once basic polars support is added #1373.
Describe the solution you'd like
By definition pandera needs to do non-lazy operations on the data to to the run-time value checks. Pandera can run metadata checks, e.g. data type checks, column name uniqueness, etc.
This is because the LazyFrame API propagates type information through a lazy query, but it cannot do run-time value checks without materializing the data at validation time.
Therefore, checks that require examining the values of the data to raise an error will do a series of non-lazy operations on the data, ideally in parallel, before raising a runtime error on collect
.
Calling schema.validate should run an implicit collect()
, and may also
do an implicit lazy()
to continue the lazy operations.
Proposal
We formalize two modes of validation:
- Metadata validation: check metadata such as primitive datatypes, e.g. int64, string, etc.
- Data value validation: check actual values.
In the polars programming model, we can do metadata validation before even running the query, but we need to actually run the query to gather the failure cases for data values that don't pass run-time checks (e.g. col >= 0).
In order to lazily raise a data value error, pandera can introduce a namespace:
(
ldf
.pandera.validate(schema, collect=False) # raises metadata errors
.with_columns(...) # do stuff
.pandera.collect() # this runs the query, raising a data value error.
# collect() also materializes a pl.DataFrame
.lazy() # convert back to lazy as desired
)
Supporting this would require adding support for lazy evaluation of
checks, so instead of CoreCheckResult
and CheckResult
, it would
require a CoreCheckPromise
, CheckPromise
, which would contain
LazyFrames or some other promise of an actual result. These would then
be run by calling polars.collect_all()
when pandera.collect
is
invoked.
Describe alternatives you've considered
The alternative would be to simply do a collect()
on schema.validate, which is what the basic polars integration would support.