polars
polars copied to clipboard
Add Read support for Apache Iceberg
Problem description
Would like to see read support for Apache Iceberg similar to the support for delta. Cc @asheeshgarg
Discussion on Rust iceberg sdk https://github.com/apache/iceberg/issues/5122
Is this feature already in the Polars team roadmap?
Could someone please advise where we can access the roadmap and verify if iceberg has been included in the Polars roadmap?
I'd love to have it!
Would like to see read support for Apache Iceberg similar to the support for delta.
FYI: I'm looking at further generalising our database support in Python (ref: https://github.com/pola-rs/polars/issues/10121). Are you intending to use directly from Rust, or the usual Python API? (If the latter, which driver do you usually use?)
@alexander-beedie, I noticed Delta is implemented in Python + Arrow.
Iceberg comitters are working on improving Arrow compatibility, but I wonder if you all prefer Rust support? This is also almost complete as well. What's the best way forward in your opinion as both options are available?
- Brian (DevRel Iceberg)
Also related: apache/iceberg#7067
In my case, python @alexander-beedie
My 2 cents on the topic, we can mimic the way we implemented things for delta and keep this on the python side of things via the scan_ds
IMO, it'd be preferred to do this on the rust side. That way we can have support for it in sql, python, ....
a datafusion based project glardb recently added iceberg support. Looks like it may be easy to port their datafusion logic over to polars. The dependencies also seem pretty lightweight, only iceberg
which only depends on apache-avro
. (no dependencies on apache-arrow
!)
I got a working version using PyIceberg here: https://github.com/pola-rs/polars/pull/10375
a datafusion based project glardb recently added iceberg support. Looks like it may be easy to port their datafusion logic over to polars.
Unfortunately, a lot of the Rust implementations out there are far from complete. Looking at the implementation at GlareDB, a couple of things are missing:
- No field-id schema resolution, and therefore it cannot read tables with evolved schemas (or it causes correctness issues!)
- Pruning of partitions
- Pruning of files, by using the metrics stored in the manifest files
The implementation only does file discovery, which is a pity since Iceberg has so much to offer.
Iceberg comitters are working on improving Arrow compatibility, but I wonder if you all prefer Rust support?
Apologies, and thanks for reaching out to us! This seemed to slip my radar; looks like @Fokko is well underway building a scan_
based approach in the Rust layer already, which looks exciting. For my own part I was looking at generalising our SQL interop so that we can handle user-instantiated connections. I'm primarily looking into Python-side SQL connectivity (I think there is also a SQL interface to Iceberg? I could be conflating DuckDB interacting with an underlying Iceberg store for native SQL though ;)
Hey @alexander-beedie, thanks for jumping in here. What kind of SQL interface are you thinking of?
We're supporting SQL-like syntax for the expressions:
large_rides_in_march = tbl.scan().filter("dt >= '2023-01-01' and dt < '2023-04-01' and passenger_count > 4").to_arrow()
We could also accept this kind of expression Polars. Also, DuckDB has recently opened up their Iceberg support, however, this is also still in an experimental state and many of the features that make Iceberg shine are still missing.
Hey @alexander-beedie, thanks for jumping in here. What kind of SQL interface are you thinking of? // ...
Great, I see it now - thanks for the clarification. I think I was conflating a quickly-scanned article involving DuckDB & Iceberg with there actually being a fully-fledged SQL query interface (and therefore also Python-side DBAPI -or equivalent- drivers), in which case things would "just work" once we start accepting user-created connection objects and their associated queries.
Looks to me like the scan_iceberg
PR you're working on is the way to go here as it'll be able to take advantage of a fuller range of features; I will defer on the implementation details review/comments to those with more Rust expertise than myself (@ritchie46, @orlp, @universalmind303), as I am merely a casual dilettante in that arena ;)