[Feature] [core] Iceberg: createMetadataWithoutBase should include full history of Paimon snapshots
Search before asking
- [x] I searched in the issues and found nothing similar.
Motivation
Code path: https://github.com/apache/paimon/blob/ff02f6bf3ceccf8dcac38bc58cf6db390509bd46/paimon-core/src/main/java/org/apache/paimon/iceberg/IcebergCommitCallback.java#L305C49-L305C59
createMetadataWithoutBase is called under certain conditions like
- enabling Iceberg compatibility for the first time
- committing Iceberg metadata after a previous commit failed in the Iceberg layer (for many possible reasons)
In some cases the Paimon table will have many snapshots already (e.g. snapshots 1 to 1000) and after calling createMetadataWithoutBase, only the latest Paimon commit will be synced to Iceberg.
By syncing the whole Paimon history to Iceberg, the Iceberg compatibility feature becomes suitable for production use cases that require Iceberg time travel.
My use case is this:
- Sync MySQL tables to Paimon using Flink CDC
- Tag daily Paimon snapshots automatically
- Iceberg readers read daily snapshots
When a failure happens in the Iceberg committer the metadata needs to be recreated on the next commit. Only the latest Paimon snapshot is included. The daily snapshot component of this pipeline has broken and cannot be recovered.
Solution
A simple solution may exist. Assume snapshot range [x, y]:
If creating metadata without base:
- createMetadataWithoutBase for the earliest snapshot, x
- (for i=x, i <= y, i++) call createMetadataWithBase(i)
Anything else?
Consider making this feature opt-in with configuration as it may be a costly operation to sync many Paimon snapshots to Iceberg and therefore reach the Flink checkpoint timeout.
Are you willing to submit a PR?
- [x] I'm willing to submit a PR!
@LsomeYeah what do you think about this issue and the suggested solution?
@LsomeYeah what do you think about this issue and the suggested solution?
@nickdelnano I think this case is reasonable, and I also prefer adding a new option.
Investigated this some. Syncing all Paimon snapshots to Iceberg looks possible. I realized I have another requirement in my use case around Paimon tags and Iceberg compatibility.
More details on my use case:
- Sync MySQL tables to Paimon using Flink CDC
- Tag daily Paimon snapshots automatically
- Iceberg readers read daily snapshots
Table configuration
snapshot.time-retained: 7d
tag.num-retained-max: 100
tag.automatic-creation: process-time
tag.creation-period: daily
After snapshot.time-retained snapshots are expired from Paimon but daily tags still exist. I need all tags available in Iceberg. IcebergCommitCallback only considers snapshots in the Paimon table so this does not work. I will check on this next.