polars
polars copied to clipboard
Allow custom lazy scanning
Describe your feature request
My understanding of the polars query planner is modest at best, so please correct me if I'm (ab)using the terms used below or if I have a conceptual misunderstanding:
pl.scan_parquet
is great for creating a pl.LazyFrame
from a large collection of parquet files and push down predicates/filters/projections/etc. Polars also support a fairly wide selection of reads/scans from other data sources. Is it possible to allow someone to pass in a python function that takes in predicates/filters/projections/etc and returns a pl.LazyFrame
? We would consider it a part of a query plan and is essentially a backdoor to augmenting the query with extra optimizations, extra data source reads, extra asserts, etc.
If I'm not mistaken, this particular PythonScanExec actually allows you to call an arbitrary function PyBytes::new(py, &self.options.scan_fn)
and get the output. It seems like it only takes in the columns but not the predicates/filters/projections/etc, and I'm not sure if this interface is exposed in python. Would we be able to build on top of this?
Yes, you can use that node. This file is an example. There we create a scan_ds
and a scan_parquet_fsspec
function.
We push projections to the node and pass that as a with_columns: list[str]
argument. The predicates are pushed down to after that node. I am not really sure how we can pass a predicate to the function itself and what benefit does it have.
What I mean by that is that predicates internally are represented as a virtual function that produces a boolean mask. Not something that's usable on the python side. And if we pass the predicate as expression then you have to apply that predicate in the function itself and that does not seem to have much benefit to us applying that predicate when that function is finished.
This is already supported in Rust via AnonymousScan
trait. This trait allows for predicate, projection & slice pushdowns. We could potentially look at exposing a python interface to interact with that trait?
Alternatively, You could even create your own extension library, with rust bindings that leverages the trait.
this is an example of such an extension. https://github.com/universalmind303/polars-mongo
Yeah I think having a python interface would be awesome! Currently the effort to build rust at my company is nascent at best, so having a python fn drop-in would require the least amount of hoops to jump