iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Check write snapshot compatibility

Open Fokko opened this issue 9 months ago • 6 comments

Feature Request / Improvement

Java and Python have a different approach here. I don't have all the historical context, but prior to Iceberg V2 tables, there was no such thing as operations:

Image

I think this is a good thing to validate against.

This should happen in the _commit method of the _SnapshotProducer. Similar to Java:

  • We should track what the current snapshot was when the table was loaded initially ([startingSnapshotId](https://github.com/apache/iceberg/blob/bcbbd0344623ffea5b092e2de5debb0bc12892a1/core/src/main/java/org/apache/iceberg/BaseReplacePartitions.java#L30 in Java).
  • We refresh the table, so we have the latest snapshots. We check from the startingSnapshotId to the current-snapshot-id if any snapshots were added. If this is the case, we want to _validate() if there are any conflicts.
  • Then we write out the manifest-list

There's also a small section on conflict resolution.

- When doing an `Append`: Adding new data
  - All okay: `{Append,Replace,Overwrite,Delete}`, don't affect the operation, and we can just append
- When doing a `Replace`:  Replacing existing data (eg. compaction)
  - Ok: Append
  - Not ok: Replace, Overwrite, Delete. We should fail, and later we can see if there is any overlap (eg compare if they touch the same partitions).
- When doing a `Overwrite`: Adding and deleting data
  - Not ok: Append, Replace, Overwrite, Delete. We should fail, and later we can see if there is any overlap (eg compare if they touch the same partitions).
- When doing a `Delete`
  - Not ok: Append, Replace, Overwrite, Delete. We should fail, and later we can see if there is any overlap (eg compare if they touch the same partitions/predicate). We should also take into account the difference between MoR and CoW.

Let's only do the very simple cases at first, so we can add ones one by one to keep the PR within reasonable size.

Once we have this in place, we can also do automatic retries: https://github.com/apache/iceberg-python/issues/269

Fokko avatar Feb 18 '25 10:02 Fokko

Let me know if anyone is interested in contributing this, otherwise I'll take a stab at it myself 🤗

Fokko avatar Feb 19 '25 20:02 Fokko

hey @Fokko, i'd be interested in contributing this if you haven't started already!

kaushiksrini avatar Feb 20 '25 02:02 kaushiksrini

@kaushiksrini I haven't feel free to pick this up 👍

Fokko avatar Feb 21 '25 11:02 Fokko

@kaushiksrini Gentle ping, updates on this? I think a lot of folks would benefit from having this. If you don't have the time, I'm also happy to take a stab at it

Fokko avatar Mar 03 '25 08:03 Fokko

Hey @Fokko, actively working on this - should have a PR out soon. Had a few questions:

  1. From the _SnapshotProducer class, what function should I call to refresh the table and get the latest snapshot available?
  2. To fetch snapshots between two IDs, I see there is a utility function in Iceberg that returns a list. I couldn't find it in the Python client - could you point me to where this function would exist?

Thanks!

kaushiksrini avatar Mar 03 '25 13:03 kaushiksrini

Thanks @kaushiksrini for picking this up. My apologies, I missed this comment:

From the _SnapshotProducer class, what function should I call to refresh the table and get the latest snapshot available?

Looking at the PR, you already found this :)

To fetch snapshots between two IDs, I see there is a utility function in Iceberg that returns a list. I couldn't find it in the Python client - could you point me to where this function would exist?

I don't think we have this one, can add it to snapshots.py

Fokko avatar Mar 20 '25 19:03 Fokko

Let me close this PR in favor of #819 since they address the same issue.

Fokko avatar Jun 24 '25 16:06 Fokko