Support commit retries
Feature Request / Improvement
Within Iceberg, when a commit fails because of a concurrent operation, we can retry the operation by loading the latest version of the snapshot, and re-apply the operation.
Few suggestions on this feature. It will be good to have control of the amount of retries and the retry strategy. After trying out a few retries libraries I found tenacity one of the most complete because it allows different options for retrying.
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
mark not stale
As a workaround, to manually retry commits, update the table metadata by using
table = table.refresh()
before calling commit() again
Have also have experience not being able write to tables in highly distributed environments. Refreshing the table in isolation, in addition to adding some retry logic did not work. The solution we found to work involved:
- Refreshing the table.
- Creating a new transaction from the refreshed table.
- Generating new snapshots for the data files involved in the previous transaction.
- Trying to commit again, if fails go back to step 1.
@maxlucuta Yes, that's what I have been using. It would be nice if there would be an internal retry for the commit so the client application doesn't have to be polluted with the retry.
It would be a difficulty to handle the local table metadata object if the retry is pushed to the library level. The commit would succeed, but the metadata object would point to a whole different snapshot.
Also interested in such feature!
@maxlucuta would you be able to share what that code looks like? Really interested to try that strategy out.
I have implemented commit retry based on the Java implementation, including:
- Configurable commit retry settings via table properties
- Logic to reload the latest table metadata on CommitFailedException and rebuild snapshots and requirements without Datafile regeneration
- Retry handling for both autocommit and transaction code paths
Looking at the current PR #330, it seems some time has passed since the last update, so I’d like to open a new PR from my branch, referencing that work.
Would this be an acceptable way to proceed?
@Fokko I have created a draft PR #2794 that implements the full set of changes. However, the diff is quite large, so I plan to split it into a series of smaller PRs as follows:
- Retry utilities
- Add
RetryConfig,run_with_retry, andrun_with_suppressed_failurehelpers. - Unit tests for the retry utilities.
- Add
- Transaction / Update refactoring (no behavior change)
- Introduce
_working_metadata,_pending_updates, and_StaticUpdatetoTransaction. - Add abstract
_reset_statemethod andTableRequirement.key()to enable replay. - All existing tests should pass with no behavior change.
- Introduce
- Transaction-level commit retry for
Update*class- Add retry-related properties to
TableProperties. - Implement
_commit_with_retry()and_reapply_updates()in Transaction. - Implement
_reset_state()and_operationsfor eachUpdate*class (e.g.,UpdateSchema,UpdateSpec) in order to maintain and refresh the operation states.
- Add retry-related properties to
- SnapshotProducer retry
- Refactor
_SnapshotProducerto separateapply()and_build_updates(). - Implement retry logic for autocommit operations.
- Add
_refresh_state()and_reset_state()for snapshot-specific state.
- Refactor
This will likely result in 5–6 PRs. If this overall direction looks reasonable, I would like to create the first PR for the retry utilities.