polars icon indicating copy to clipboard operation
polars copied to clipboard

Allow custom lazy scanning

Open OneRaynyDay opened this issue 1 year ago • 3 comments

Describe your feature request

My understanding of the polars query planner is modest at best, so please correct me if I'm (ab)using the terms used below or if I have a conceptual misunderstanding:

pl.scan_parquet is great for creating a pl.LazyFrame from a large collection of parquet files and push down predicates/filters/projections/etc. Polars also support a fairly wide selection of reads/scans from other data sources. Is it possible to allow someone to pass in a python function that takes in predicates/filters/projections/etc and returns a pl.LazyFrame? We would consider it a part of a query plan and is essentially a backdoor to augmenting the query with extra optimizations, extra data source reads, extra asserts, etc.

If I'm not mistaken, this particular PythonScanExec actually allows you to call an arbitrary function PyBytes::new(py, &self.options.scan_fn) and get the output. It seems like it only takes in the columns but not the predicates/filters/projections/etc, and I'm not sure if this interface is exposed in python. Would we be able to build on top of this?

OneRaynyDay avatar Aug 09 '22 16:08 OneRaynyDay

Yes, you can use that node. This file is an example. There we create a scan_ds and a scan_parquet_fsspec function.

We push projections to the node and pass that as a with_columns: list[str] argument. The predicates are pushed down to after that node. I am not really sure how we can pass a predicate to the function itself and what benefit does it have.

What I mean by that is that predicates internally are represented as a virtual function that produces a boolean mask. Not something that's usable on the python side. And if we pass the predicate as expression then you have to apply that predicate in the function itself and that does not seem to have much benefit to us applying that predicate when that function is finished.

ritchie46 avatar Aug 09 '22 18:08 ritchie46

This is already supported in Rust via AnonymousScan trait. This trait allows for predicate, projection & slice pushdowns. We could potentially look at exposing a python interface to interact with that trait?

Alternatively, You could even create your own extension library, with rust bindings that leverages the trait.

this is an example of such an extension. https://github.com/universalmind303/polars-mongo

universalmind303 avatar Aug 11 '22 03:08 universalmind303

Yeah I think having a python interface would be awesome! Currently the effort to build rust at my company is nascent at best, so having a python fn drop-in would require the least amount of hoops to jump

OneRaynyDay avatar Aug 18 '22 15:08 OneRaynyDay