iceberg-python
iceberg-python copied to clipboard
Support Snapshot Management Operations
Feature Request / Improvement
Following is a list of operations that are supported in Spark:
- rollback_to_snapshot (set_ref_snapshot)
- rollback_to_timestamp (set_ref_snapshot)
- set_current_snapshot (set_ref_snapshot)
- cherrypick_snapshot
- publish_changes
- fast_forward
set_ref_snapshot support will be introduced, but it would also be nice to add snapshot migration operations that generate new snapshots from existing ones
@syun64 I have a few questions about the operations and I couldn't find more info in the docs. Apologies if these have been answered elsewhere.
-
How does rollback_to_timestamp use
set_ref_snapshot()
? In the rollback_to_timestamp documentation, inputs are Table and timestamp, and I can't find asnapshot_by_timestamp()
api to get the snapshot_id. -
for cherrypick_snapshot and publish_changes (docs), wouldn't we need an
add_snapshot()
table api ? I noticed bothadd_snapshot()
andset_ref_snapshot()
were removed in the same PR. Do we bring backadd_snapshot()
as well?
- How does rollback_to_timestamp use
set_ref_snapshot()
? In the rollback_to_timestamp documentation, inputs are Table and timestamp, and I can't find asnapshot_by_timestamp()
api to get the snapshot_id.
- Yeah I think it would be helpful to introduce a
snapshot_by_timestamp
utility function to get the snapshot - just like you mentioned, that would help recover feature parity with the existing Java API
I don't think add_snapshot
needs to be an API because the function is incredibly simple, in that it just adds a AddSnapshotUpdate table update and AssertTableUUID requirement. I think instead, functions like cherrypick_snapshot, publish_changes should be separate APIs that builds the new snapshot and then makes a commit with the updated snapshot. WDYT?
Yeah I think it would be helpful to introduce a snapshot_by_timestamp utility function to get the snapshot - just like you mentioned, that would help recover feature parity with the existing Java API
Yeah, looking at the Java api, something like findLatestAncestorOlderThan
makes sense.
WDYT?
Sounds good to me!
Hi @syun64 @chinmay-bhat. Thank you for driving these features. I saw some PRs raised by @chinmay-bhat : #728, #748, #750, #758. Those are great!
I would like to discuss what APIs we want to expose to users. I am actually leaning toward removing set_ref_snapshot
from public API. The reason is that set_ref_snapshot
itself requires too many details from users and thus error-prone. It shall sit behind APIs like rollback_to_snapshot
, create_tag
, cherryPick
which have more specific functionality and simpler to use.
We may also want to group these APIs into a class (e.g. ManageSnapshots
), like the java implementation. So the resulting user experience with these APIs will be
with table.transaction() as txn:
... other updates
with txn.manage_snapshots() as ms:
ms.createTag("Tag_A", 0)
ms.createTag("Tag_B", 1)
We could also keep table.manage_snapshots()
for convenience. Similar to the usage of UpdateSchema.
What do you think about this? Appreciate any thoughts and suggestions!
Also kindly looping @Fokko to this thread.
@HonahX thank you for your response.
I agree that we should hide the set_ref_snapshot
from the public API. I also like the idea of creating a ManageSnapshots
inner class in Transaction
to organise the APIs, while exposing them through Transaction
and Table
.