Core: Allow SnapshotProducer to skip uncommitted manifest cleanup after commit
Skips FastAppend manifest cleanup after successful commit if no retries have occurred, as no orphaned manifests could exist if no retries have occurred. This speeds up the happy path of commits by removing 2 unnecessary reads:
- table metadata READ
- manifest list READ
Context from slack thread: https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1718381807647999
We are ingesting streaming data using a java service that does iceberg FastAppend We noticed about ~20% (YMMV) of the fastappend commit time for our usecase is spent on nonrequired cleanup operations, specifically this bit which FastAppend inherits from SnapshotProducer: https://github.com/apache/iceberg/blob/apache-iceberg-1.5.2/core/src/main/java/org/apache/iceberg/SnapshotProducer.java#L422-L439
Testing:
- I manually tested this by running
TestFastAppend.testRecoveryWithManifestListand verifying the cleanup bits are only run when a retry occurs.
Notes:
- we do not skip cleanup operations on commit failures (see:
cleanAll()) - diff best viewed with
?w=1: https://github.com/apache/iceberg/pull/10523/files?w=1
Found a problem with the approach.
This assumption is incorrect for all SnapshotProducer
no orphaned manifests could exist if no retries have occurred.
This is incorrect for MergingSnapshotProducer which merges manifests by writing the unmerged manifests, creating a new merged manifest, and marking the unmerged manifests for deletion. You can see this in MergingSnapshotProducer.apply()
See new commit for new approach. It requires SnapshotProducer to opt-in to this optimization with a method like protected boolean canSkipCleanupAfterCommitSuccess() which defaults to false. Suggestion is only latency sensitive operations like FastAppend would override this method.
This is incorrect for MergingSnapshotProducer which merges manifests by writing the unmerged manifests, creating a new merged manifest, and marking the unmerged manifests for deletion. You can see this in MergingSnapshotProducer.apply()
@grantatspothero Thank you for finding this, that's a good catch, I forgot about the merging manifest case. Did we have tests which fail with the original assumption implementation (due to not cleaning up) or is this something you got from reading the code? If we don't have tests which assert that the old manifests are removed after the merging of manifests, I'd advocate for adding tests for that case.
Also adding @rdblue for his input since he's had thoughts on eager cleanups, to make sure we're not missing anything here.
Tests caught this issue, was helpful 🙌 (one example: TestHadoopCommits.testMergeAppend)
Sorry for the delay on reviewing this @grantatspothero I'm taking a look with fresh eyes on the latest updates since the approach is different now
Thanks @grantatspothero the overall approach makes sense and this time it is closely dependent on the internal state of FastAppend which combined with the new tests should make it less brittle; if someone goes ahead and changes the logic, tests would fail. Just had some cleanups I think we should get in.
A question:
I'm not exactly sure how the "table metadata READ" and "manifest list READ" translate into calls to the underlying object store. Does "manifest list READ" result in a request to list all objects matching a particular prefix? And if so, could having many manifest files result in this operation becoming slow? In that case, I assume that rewriting manifests could help such a situation?
If not, what makes those operations slow? And how slow - approximately - are they, in absolute terms?
A question:
I'm not exactly sure how the "table metadata READ" and "manifest list READ" translate into calls to the underlying object store. Does "manifest list READ" result in a request to list all objects matching a particular prefix? And if so, could having many manifest files result in this operation becoming slow? In that case, I assume that rewriting manifests could help such a situation?
If not, what makes those operations slow? And how slow - approximately - are they, in absolute terms?
"table metadata READ" and "manifest list READ" are both single object storage GETs. so 2 extra network requests that are not needed to actually perform the commit.
Regarding how that affects total runtime, see the PR description:
We are ingesting streaming data using a java service that does iceberg FastAppend
We noticed about ~20% (YMMV) of the fastappend commit time for our usecase is spent on nonrequired cleanup operations, specifically this bit which FastAppend inherits from SnapshotProducer:
The extra network requests are definitely noticeable for fast appends of small files. For our usecase, the iceberg metadata files are large because there are lots of unexpired snapshots so fetching a large metadata file from s3 is slow and that exacerbates the problem.
But if you have small metadata files and are not using FastAppend then you probably do not care much about this optimization.
Thanks, @grantatspothero!