iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

PyIceberg Cookbook

Open kevinjqliu opened this issue 1 year ago • 1 comments

Feature Request / Improvement

It was brought up at the recent community sync that we should start a cookbook to capture different use cases with PyIceberg. Similar to the Tabular Iceberg cookbook

Starting this issue to track the creation of the cookbook. And more importantly, what items people would like to see to be included in the cookbook.

Feel free to add suggestions below.

kevinjqliu avatar Sep 24 '24 17:09 kevinjqliu

Copying over from community sync

Cookbook suggestions

  • Support for incremental processing with "change table" (link)
  • Create a table like another table
  • Get data file references between two given snapshot ids or timestamps

kevinjqliu avatar Sep 24 '24 17:09 kevinjqliu

@kevinjqliu are you accepting contributions for this cookbook yet? Happy to help if so!

shiv-io avatar Oct 19 '24 16:10 shiv-io

Hi @shiv-io, yes, we're accepting contributions. We currently don't have a page set up for the cookbook yet.

kevinjqliu avatar Oct 19 '24 18:10 kevinjqliu

Hey! I'm creating a PoC using PyIceberg for a project. I'm quite interested in incremental processing.

For this, what I've used before were MERGE operations to update the table (I was using Delta with Spark at the time) with data from a DataFrame.

Is this possible yet? Something similar would be overwrite + overwrite_filter, but I can't really use that with a DataFrame, I'd have to pass it as a string, right? And in that case, a IN clause with thousands of IDs would deteriorate performance

francocalvo avatar Nov 06 '24 13:11 francocalvo

hey @francocalvo the MERGE operation is not yet support (https://github.com/apache/iceberg-python/issues/402) For write, pyiceberg currently supports append and overwrite. I think overwrite + overwrite_filter gets you close to the MERGE use case.

but I can't really use that with a DataFrame, I'd have to pass it as a string, right?

The writes work with pyarrow tables and dataframe. Im don't think you need to pass as string

And in that case, a IN clause with thousands of IDs would deteriorate performance

It depends on the exact logic. But we do some optimizations such as filter pushdowns to speed up reads and writes

kevinjqliu avatar Nov 06 '24 17:11 kevinjqliu

Thank you for the prompt answer!

The writes work with pyarrow tables and dataframe. Im don't think you need to pass as string

Yes, what I mean is when I need to update an Iceberg table using a Arrow table. In other cases I used a MERGE with a WHEN MATCHED UPDATE clause. This allowed me to 'soft-delete' old versions (It's a SCD Type 2 table). In some cases, I need to update +10k rows in one go, and match them based on an ID. Reading the code, I see that I can write with Arrow tables, but not create filters for that.

In any case, I'm glad this exists and hope the cookbook creates a good starting point for people that are trying this out.

francocalvo avatar Nov 07 '24 13:11 francocalvo

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar May 07 '25 00:05 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar May 21 '25 00:05 github-actions[bot]