iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

add support for DuckDB views as a valid data format

Open djouallah opened this issue 1 year ago • 3 comments

Feature Request / Improvement

arrow table is taking a lot of memory and crash the system with any non trivial amount of data please add DuckDB views a valid data source format, I had the same issue with Delta_rs writer, arrow us cool and all but nothing beat native.

djouallah avatar Feb 10 '24 11:02 djouallah

@djouallah Thanks for reporting, could you provide some more specifics on the tests you are running which are exceeding memory? If you're able to share queries, the scale of the tables involved, and the hardware specs of what you're running on, that would be helpful context.

Also sorry I'm not familiar with DuckDB views but at first glance they look like typical logical views. Iceberg has defined a spec for views which defines common metadata which can be used across engines (see view spec). There's a Java implementation of the spec, with spark engine integration slated for the Java 1.5 release, and Trino as well.

I think if there was an Iceberg view representation which referenced the DuckDB view/queries you wanted then you could get what you wanted by first getting the representation for duckdb and then performing the computation on the view which could execute against DuckDB natively in your own code.

That would require Python support for Iceberg views. Before all that though, I encourage if you can share all the test details you can so that way we can make sure we're doing things as intelligently as possible with our current Arrow machinery.

amogh-jahagirdar avatar Feb 12 '24 05:02 amogh-jahagirdar

here is a reproducible example, in real life I loaded 2100 files which generate an OOM. https://colab.research.google.com/drive/1yultO4Wc94kJMhnTzwa5Gcrfp8VeDubg#scrollTo=TZ29HGpXx6BN for duckdb views, I mean a way to get data from duckdb for Python Iceberg without necessarily materialized it. think of it as arrow dataset, not table

djouallah avatar Feb 12 '24 05:02 djouallah

Sounds like the ask here is for similar functionality in duckdb as was implemented in polars scan_iceberg. This relates also the previously discussed PyArrow Dataset protocol -- not sure if that's still under consideration?

I know there's a c++ duckdb iceberg extension but that seems like it's not being actively developed; having a lazy bridge to iceberg in duckdb via pyiceberg would thus still be really useful.

corleyma avatar Feb 29 '24 07:02 corleyma