iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Use Iceberg-Rust for parsing the ManifestList and Manifests

Open Fokko opened this issue 7 months ago • 7 comments

Rationale for this change

This replaces the Cython implementation for reading Avro with Iceberg-Rust. This would greatly simplify the PyIceberg project since we don't have to publish Python wheels anymore.

Are these changes tested?

Are there any user-facing changes?

Fokko avatar May 14 '25 19:05 Fokko

@roeap Thanks, your patch fixed correctly passing through the 102: partition field 🥳

Fokko avatar Jun 26 '25 21:06 Fokko

@Fokko This is awesome! thanks for your work on this one. Do we expect any performance/memory differences with reading it with the rust module?

yogevyuval avatar Jul 08 '25 19:07 yogevyuval

@yogevyuval Thanks for asking, and yes, I do expect performance impact since just a part of the deserialization is cythonized. With this change, much more is pushed into Rust. This will also reduce the GIL pressure since we don't have to build the readers anymore :)

I left this PR small to focus on the essentials, but once this is in, we can also clean up a LOT of code 👍

Fokko avatar Jul 08 '25 20:07 Fokko

@yogevyuval Thanks for asking, and yes, I do expect performance impact since just a part of the deserialization is cythonized. With this change, much more is pushed into Rust. This will also reduce the GIL pressure since we don't have to build the readers anymore :)

I left this PR small to focus on the essentials, but once this is in, we can also clean up a LOT of code 👍

Interesting, we have quite a bit of concurrency going on so curious to see the results. Once https://github.com/apache/iceberg-rust/pull/1328 gets merged and release i'll try and create some basic benchmarks and share the results

yogevyuval avatar Jul 20 '25 22:07 yogevyuval

Waiting for this one https://github.com/apache/iceberg-rust/issues/1587 as it will make it much easier for PyIceberg to read both versions (and less work on the PyIceberg side :)

Fokko avatar Aug 21 '25 19:08 Fokko

For documentation - I tested out 0.7rc and in a manifest with 50K entries the new rust version in python took more than 2X than the original version. I'm not sure if that's because a different parsing strategy (laziness) as @Fokko suggested, or something regarding objects transefrring from rust to python, but I think we should investigate before merging, as this code path is already one of the slowest in pyiceberg

yogevyuval avatar Sep 27 '25 20:09 yogevyuval

@yogevyuval That's not good news, but thanks for checking 👍 I agree that we should find out what's going on before merging this. Thanks for doing some testing, that's very much appreciated

Fokko avatar Oct 21 '25 18:10 Fokko