iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

Implement Snapshot validation API for commits

Open aiborodin opened this issue 1 month ago • 6 comments

This change implements the ability to add custom Snapshot validations to the existing SnapshotUpdate API.

It focuses on reusing existing validation APIs in SnapshotProducer and removing code duplication, while providing enough flexibility for clients, such as Kafka Connect and Flink.

It is an alternative solution to https://github.com/apache/iceberg/pull/14509.

Why?

Custom Snapshot validation is necessary for non-idempotent table update operations, which rely on the existing state for correctness and exactly-once delivery. Applications like Flink and Kafka Connect use snapshot properties to store their idempotence keys, which identify the base state during recovery. Due to the nature of concurrent commits in Iceberg, these applications need the ability to check information of the new base snapshots to identify idempotence violations. This change addresses this problem and allows clients to implement custom idempotence validations.

How?

This change achieves the following:

  1. Add the new validateWith(Consumer<Snapshot> snapshotValidator) method to the SnapshotUpdate to allow custom Snapshot validations.
  2. Move the existing duplicate definitions of validateFromSnapshot(long snapshotId) from child interfaces (OverwriteFiles, ReplacePartitions, and RowDelta, etc.) to the parent SnapshotUpdate and remove duplicate definitions.
  • This method allows configuring the starting Snapshot for validation, which is used by the Kafka Connect PR (https://github.com/apache/iceberg/pull/14517) and other validations.
  1. Implement snapshot validation in the base class SnapshotProducer::validate(TableMetadata currentMetadata, Snapshot snapshot), allowing child classes to inherit the validation functionality.

This approach results in maximised code reuse and aligns with the existing validation functionality.

Impact?

This change is used as a parent in the following PRs:

  1. Implement commit validations for Kafka connect: https://github.com/apache/iceberg/pull/14515
  2. Fix commit duplication issue in Flink: https://github.com/apache/iceberg/pull/14517

aiborodin avatar Nov 06 '25 07:11 aiborodin

@rdblue What do you think?

aiborodin avatar Nov 06 '25 22:11 aiborodin

I just thought of another issue with this approach. This use of startingSnapshotId conflicts with other uses. Here, it determines the starting point of validation, which is where the table state was when the transaction started (because KC determined what to commit and the current offset from the current snapshot). In other places, it is used for the table version that was actually read -- which may not be current when the operation starts. I think that difference could lead to needing two separate values when combining history validation with other validations that use startingSnapshotId.

rdblue avatar Nov 07 '25 00:11 rdblue

I think that difference could lead to needing two separate values when combining history validation with other validations that use startingSnapshotId.

Wouldn't both the history validation and other validations want to check the same set of parent snapshots? I can't think of any use case where someone would want to validate different ranges of new base snapshots, and it feels like in both cases we should have the same API entry point for setting the starting snapshot. What do you think?

aiborodin avatar Nov 07 '25 04:11 aiborodin

I think that the starting snapshot ID is often based on what an operation read. This validation is likely going to be based on the current version when the transaction started. Since those are different, I would not reuse starting snapshot ID.

rdblue avatar Nov 07 '25 17:11 rdblue

This validation is likely going to be based on the current version when the transaction started.

But would it be reasonable for the operation that has read a different version (not the current one) to run the custom validation from that read version? Why would it want to run the custom validation from the current instead of the read version if it based all its update assumptions on that read version?

aiborodin avatar Nov 10 '25 05:11 aiborodin

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Dec 11 '25 00:12 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Dec 19 '25 00:12 github-actions[bot]