polars icon indicating copy to clipboard operation
polars copied to clipboard

Add Read support for Apache Iceberg

Open nazq opened this issue 2 years ago • 1 comments

Problem description

Would like to see read support for Apache Iceberg similar to the support for delta. Cc @asheeshgarg

nazq avatar Jan 14 '23 17:01 nazq

Discussion on Rust iceberg sdk https://github.com/apache/iceberg/issues/5122

nazq avatar Jan 14 '23 17:01 nazq

Is this feature already in the Polars team roadmap?

Guillem96 avatar Feb 13 '23 07:02 Guillem96

Could someone please advise where we can access the roadmap and verify if iceberg has been included in the Polars roadmap?

luancaarvalho avatar Jul 21 '23 16:07 luancaarvalho

I'd love to have it!

tomaszdudek7 avatar Jul 28 '23 05:07 tomaszdudek7

Would like to see read support for Apache Iceberg similar to the support for delta.

FYI: I'm looking at further generalising our database support in Python (ref: https://github.com/pola-rs/polars/issues/10121). Are you intending to use directly from Rust, or the usual Python API? (If the latter, which driver do you usually use?)

alexander-beedie avatar Jul 28 '23 07:07 alexander-beedie

@alexander-beedie, I noticed Delta is implemented in Python + Arrow.

Iceberg comitters are working on improving Arrow compatibility, but I wonder if you all prefer Rust support? This is also almost complete as well. What's the best way forward in your opinion as both options are available?

  • Brian (DevRel Iceberg)

bitsondatadev avatar Aug 03 '23 19:08 bitsondatadev

Also related: apache/iceberg#7067

bitsondatadev avatar Aug 03 '23 19:08 bitsondatadev

In my case, python @alexander-beedie

luancaarvalho avatar Aug 03 '23 19:08 luancaarvalho

My 2 cents on the topic, we can mimic the way we implemented things for delta and keep this on the python side of things via the scan_ds

chitralverma avatar Aug 04 '23 06:08 chitralverma

IMO, it'd be preferred to do this on the rust side. That way we can have support for it in sql, python, ....

a datafusion based project glardb recently added iceberg support. Looks like it may be easy to port their datafusion logic over to polars. The dependencies also seem pretty lightweight, only iceberg which only depends on apache-avro. (no dependencies on apache-arrow!)

universalmind303 avatar Aug 08 '23 02:08 universalmind303

I got a working version using PyIceberg here: https://github.com/pola-rs/polars/pull/10375

a datafusion based project glardb recently added iceberg support. Looks like it may be easy to port their datafusion logic over to polars.

Unfortunately, a lot of the Rust implementations out there are far from complete. Looking at the implementation at GlareDB, a couple of things are missing:

  • No field-id schema resolution, and therefore it cannot read tables with evolved schemas (or it causes correctness issues!)
  • Pruning of partitions
  • Pruning of files, by using the metrics stored in the manifest files

The implementation only does file discovery, which is a pity since Iceberg has so much to offer.

Fokko avatar Aug 08 '23 18:08 Fokko

Iceberg comitters are working on improving Arrow compatibility, but I wonder if you all prefer Rust support?

Apologies, and thanks for reaching out to us! This seemed to slip my radar; looks like @Fokko is well underway building a scan_ based approach in the Rust layer already, which looks exciting. For my own part I was looking at generalising our SQL interop so that we can handle user-instantiated connections. I'm primarily looking into Python-side SQL connectivity (I think there is also a SQL interface to Iceberg? I could be conflating DuckDB interacting with an underlying Iceberg store for native SQL though ;)

alexander-beedie avatar Aug 09 '23 07:08 alexander-beedie

Hey @alexander-beedie, thanks for jumping in here. What kind of SQL interface are you thinking of?

We're supporting SQL-like syntax for the expressions:

large_rides_in_march = tbl.scan().filter("dt >= '2023-01-01' and dt < '2023-04-01' and passenger_count > 4").to_arrow()

We could also accept this kind of expression Polars. Also, DuckDB has recently opened up their Iceberg support, however, this is also still in an experimental state and many of the features that make Iceberg shine are still missing.

Fokko avatar Aug 09 '23 11:08 Fokko

Hey @alexander-beedie, thanks for jumping in here. What kind of SQL interface are you thinking of? // ...

Great, I see it now - thanks for the clarification. I think I was conflating a quickly-scanned article involving DuckDB & Iceberg with there actually being a fully-fledged SQL query interface (and therefore also Python-side DBAPI -or equivalent- drivers), in which case things would "just work" once we start accepting user-created connection objects and their associated queries.

Looks to me like the scan_iceberg PR you're working on is the way to go here as it'll be able to take advantage of a fuller range of features; I will defer on the implementation details review/comments to those with more Rust expertise than myself (@ritchie46, @orlp, @universalmind303), as I am merely a casual dilettante in that arena ;)

alexander-beedie avatar Aug 09 '23 12:08 alexander-beedie