iceberg Core: checkpoint validation in BaseOverwriteFiles

Feature Request / Improvement

Request

The SnapshotProducer API provides the capability to validate that the snapshots in the latest table metadata don't introduce changes that conflict with the new snapshot being committed. BaseOverwriteFiles implements custom validation that validates only snapshots from a supplied startingSnapshotId. When the validation fails and the commit is retried, BaseOverwriteFiles::validate will validate the newly landed snapshots from the base metadata and then re-validate snapshots that it previously validated.

I'm requesting we checkpoint the validation in BaseOverwriteFiles so that, on commit retry, we can avoid revalidating snapshots that have been validated on the previous commit attempt. The change itself amounts to setting startingSnapshotId to the latest snapshot of the base table metadata after successful validation [code].

Motivation

We understand that iceberg is not optimized for a high frequency of commits. However, we have observed that workloads users spin up backfill jobs with ~10-100 workers, with each worker committing in parallel, frequently fail. If the total number of writes across all workers is 1k+, its possible for conflicts to cause commit retries. Even with an aggressive retry policy, writes can fail because on each retry, the validation step needs to revalidate all of the snapshots validated in the last attempt and validate the new snapshots that landed during the backoff period. The increasing duration of the validation step means that other writers are more likely to land commits as the number of retries increases.

This creates a bimodal distribution of commit latencies where commits either succeed relatively quickly or get stuck in an increasingly expensive validation cycle that eventually exhausts all available retries.

Most of these users are not sensitive to the latency of their backfill jobs and just don't want the jobs to fail (i.e. the latency of effectively serializing all of their commits does not matter). We've created a patched BaseOverwriteFiles implementation that performs checkpointing and have seen marked improvement in the failure rate of parallel backfill-style workloads. We want to eventually migrate some of these users to a compute engine that stages writes and performs a single commit.

--

If this change looks reasonable, I'm happy to file a PR that implements checkpointing for BaseOvewriteFiles and other SnapshotProducer subclasses which implement validation from a startingSnapshotId.

Query engine

None

Feb 12 '24 21:02 hrishisd

@hrishisd Thanks for reporting this, all the details provided make sense. The solution for keeping track of the last validated snapshot so retries are more efficient makes logical sense to me, but it would be great to see the code for that to see the exact mechanics of how it's done. If you're able to raise a PR, we can review it!

Feb 13 '24 23:02 amogh-jahagirdar

Sounds good, I'll give it a shot this weekend

Feb 15 '24 02:02 hrishisd

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Oct 18 '24 00:10 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

Nov 01 '24 00:11 github-actions[bot]