PyIceberg Cookbook
Feature Request / Improvement
It was brought up at the recent community sync that we should start a cookbook to capture different use cases with PyIceberg. Similar to the Tabular Iceberg cookbook
Starting this issue to track the creation of the cookbook. And more importantly, what items people would like to see to be included in the cookbook.
Feel free to add suggestions below.
Copying over from community sync
Cookbook suggestions
- Support for incremental processing with "change table" (link)
- Create a table like another table
- Get data file references between two given snapshot ids or timestamps
@kevinjqliu are you accepting contributions for this cookbook yet? Happy to help if so!
Hi @shiv-io, yes, we're accepting contributions. We currently don't have a page set up for the cookbook yet.
Hey! I'm creating a PoC using PyIceberg for a project. I'm quite interested in incremental processing.
For this, what I've used before were MERGE operations to update the table (I was using Delta with Spark at the time) with data from a DataFrame.
Is this possible yet? Something similar would be overwrite + overwrite_filter, but I can't really use that with a DataFrame, I'd have to pass it as a string, right? And in that case, a IN clause with thousands of IDs would deteriorate performance
hey @francocalvo
the MERGE operation is not yet support (https://github.com/apache/iceberg-python/issues/402)
For write, pyiceberg currently supports append and overwrite. I think overwrite + overwrite_filter gets you close to the MERGE use case.
but I can't really use that with a DataFrame, I'd have to pass it as a string, right?
The writes work with pyarrow tables and dataframe. Im don't think you need to pass as string
And in that case, a IN clause with thousands of IDs would deteriorate performance
It depends on the exact logic. But we do some optimizations such as filter pushdowns to speed up reads and writes
Thank you for the prompt answer!
The writes work with pyarrow tables and dataframe. Im don't think you need to pass as string
Yes, what I mean is when I need to update an Iceberg table using a Arrow table. In other cases I used a MERGE with a WHEN MATCHED UPDATE clause. This allowed me to 'soft-delete' old versions (It's a SCD Type 2 table). In some cases, I need to update +10k rows in one go, and match them based on an ID. Reading the code, I see that I can write with Arrow tables, but not create filters for that.
In any case, I'm glad this exists and hope the cookbook creates a good starting point for people that are trying this out.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'