iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

PyIceberg-Core: Push down Parquet reading to Iceberg-Rust

Open Fokko opened this issue 9 months ago • 2 comments

Is your feature request related to a problem or challenge?

As a next step in integrating PyIceberg and Iceberg-Rust, it would be great to push down the Parquet reading (including all the schema evolution) to Iceberg-Rust. Today, in PyIceberg, we go over each of the record batches, which causes a lot of pressure on the GIL. This logic should all happen in the Parquet reader (schema evolution, projecting missing columns, renames, re-ordering, etc), but from PyArrow we don't have the flexibility to project on ID, so this is what we ended up with.

The most logical separation would be to pass the FileScanTask into Iceberg-Rust.

We can break it down into building blocks:

  • Ability to leverage the Iceberg-Rust FileIO in PyIceberg to open up streams
  • Ability to pass down a PyIceberg schema into Iceberg-Rust.
    • Can we serialize it into JSON? But that seems to be costly. Ideally, we want to reuse objects and not have to copy them from one to the other.
  • Pass down expressions.

From the callgraph:

Image

Image

Describe the solution you'd like

No response

Willingness to contribute

None

Fokko avatar Mar 27 '25 09:03 Fokko

This can be reasonably straight forward, depending on how you want to handle predicates. SQL strings or expressions.

A lazy record batch iterator can be done with this: https://github.com/delta-io/delta-rs/blob/main/python%2Fsrc%2Freader.rs

Then you would construct the TableProvider, scan, project it and then return that lazy iterator.

ion-elgreco avatar Apr 15 '25 21:04 ion-elgreco

Hey @Fokko, interested in picking this up!

kaushiksrini avatar Oct 14 '25 21:10 kaushiksrini