PyIceberg-Core: Push down Parquet reading to Iceberg-Rust
Is your feature request related to a problem or challenge?
As a next step in integrating PyIceberg and Iceberg-Rust, it would be great to push down the Parquet reading (including all the schema evolution) to Iceberg-Rust. Today, in PyIceberg, we go over each of the record batches, which causes a lot of pressure on the GIL. This logic should all happen in the Parquet reader (schema evolution, projecting missing columns, renames, re-ordering, etc), but from PyArrow we don't have the flexibility to project on ID, so this is what we ended up with.
The most logical separation would be to pass the FileScanTask into Iceberg-Rust.
We can break it down into building blocks:
- Ability to leverage the Iceberg-Rust FileIO in PyIceberg to open up streams
- Ability to pass down a PyIceberg schema into Iceberg-Rust.
- Can we serialize it into JSON? But that seems to be costly. Ideally, we want to reuse objects and not have to copy them from one to the other.
- Pass down expressions.
From the callgraph:
Describe the solution you'd like
No response
Willingness to contribute
None
This can be reasonably straight forward, depending on how you want to handle predicates. SQL strings or expressions.
A lazy record batch iterator can be done with this: https://github.com/delta-io/delta-rs/blob/main/python%2Fsrc%2Freader.rs
Then you would construct the TableProvider, scan, project it and then return that lazy iterator.
Hey @Fokko, interested in picking this up!