polars Add Read support for Apache Iceberg

Problem description

Would like to see read support for Apache Iceberg similar to the support for delta. Cc @asheeshgarg

Jan 14 '23 17:01 nazq

Discussion on Rust iceberg sdk https://github.com/apache/iceberg/issues/5122

Jan 14 '23 17:01 nazq

Is this feature already in the Polars team roadmap?

Feb 13 '23 07:02 Guillem96

Could someone please advise where we can access the roadmap and verify if iceberg has been included in the Polars roadmap?

Jul 21 '23 16:07 luancaarvalho

I'd love to have it!

Jul 28 '23 05:07 tomaszdudek7

Would like to see read support for Apache Iceberg similar to the support for delta.

FYI: I'm looking at further generalising our database support in Python (ref: https://github.com/pola-rs/polars/issues/10121). Are you intending to use directly from Rust, or the usual Python API? (If the latter, which driver do you usually use?)

Jul 28 '23 07:07 alexander-beedie

@alexander-beedie, I noticed Delta is implemented in Python + Arrow.

Iceberg comitters are working on improving Arrow compatibility, but I wonder if you all prefer Rust support? This is also almost complete as well. What's the best way forward in your opinion as both options are available?

Brian (DevRel Iceberg)

Aug 03 '23 19:08 bitsondatadev

Also related: apache/iceberg#7067

Aug 03 '23 19:08 bitsondatadev

In my case, python @alexander-beedie

Aug 03 '23 19:08 luancaarvalho

My 2 cents on the topic, we can mimic the way we implemented things for delta and keep this on the python side of things via the scan_ds

Aug 04 '23 06:08 chitralverma

IMO, it'd be preferred to do this on the rust side. That way we can have support for it in sql, python, ....

a datafusion based project glardb recently added iceberg support. Looks like it may be easy to port their datafusion logic over to polars. The dependencies also seem pretty lightweight, only iceberg which only depends on apache-avro. (no dependencies on apache-arrow!)

Aug 08 '23 02:08 universalmind303

I got a working version using PyIceberg here: https://github.com/pola-rs/polars/pull/10375

a datafusion based project glardb recently added iceberg support. Looks like it may be easy to port their datafusion logic over to polars.

Unfortunately, a lot of the Rust implementations out there are far from complete. Looking at the implementation at GlareDB, a couple of things are missing:

No field-id schema resolution, and therefore it cannot read tables with evolved schemas (or it causes correctness issues!)
Pruning of partitions
Pruning of files, by using the metrics stored in the manifest files

The implementation only does file discovery, which is a pity since Iceberg has so much to offer.

Aug 08 '23 18:08 Fokko

Iceberg comitters are working on improving Arrow compatibility, but I wonder if you all prefer Rust support?

Apologies, and thanks for reaching out to us! This seemed to slip my radar; looks like @Fokko is well underway building a scan_ based approach in the Rust layer already, which looks exciting. For my own part I was looking at generalising our SQL interop so that we can handle user-instantiated connections. I'm primarily looking into Python-side SQL connectivity (I think there is also a SQL interface to Iceberg? I could be conflating DuckDB interacting with an underlying Iceberg store for native SQL though ;)

Aug 09 '23 07:08 alexander-beedie

Hey @alexander-beedie, thanks for jumping in here. What kind of SQL interface are you thinking of?

We're supporting SQL-like syntax for the expressions:

large_rides_in_march = tbl.scan().filter("dt >= '2023-01-01' and dt < '2023-04-01' and passenger_count > 4").to_arrow()

We could also accept this kind of expression Polars. Also, DuckDB has recently opened up their Iceberg support, however, this is also still in an experimental state and many of the features that make Iceberg shine are still missing.

Aug 09 '23 11:08 Fokko

Hey @alexander-beedie, thanks for jumping in here. What kind of SQL interface are you thinking of? // ...

Great, I see it now - thanks for the clarification. I think I was conflating a quickly-scanned article involving DuckDB & Iceberg with there actually being a fully-fledged SQL query interface (and therefore also Python-side DBAPI -or equivalent- drivers), in which case things would "just work" once we start accepting user-created connection objects and their associated queries.

Looks to me like the scan_iceberg PR you're working on is the way to go here as it'll be able to take advantage of a fuller range of features; I will defer on the implementation details review/comments to those with more Rust expertise than myself (@ritchie46, @orlp, @universalmind303), as I am merely a casual dilettante in that arena ;)

Aug 09 '23 12:08 alexander-beedie

polars polars copied to clipboard

Add Read support for Apache Iceberg

Problem description

polars
polars copied to clipboard