iceberg-python icon indicating copy to clipboard operation
iceberg-python copied to clipboard

Support commit retries

Open Fokko opened this issue 1 year ago • 4 comments

Feature Request / Improvement

Within Iceberg, when a commit fails because of a concurrent operation, we can retry the operation by loading the latest version of the snapshot, and re-apply the operation.

Fokko avatar Jan 16 '24 11:01 Fokko

Few suggestions on this feature. It will be good to have control of the amount of retries and the retry strategy. After trying out a few retries libraries I found tenacity one of the most complete because it allows different options for retrying.

nicor88 avatar Jan 16 '24 17:01 nicor88

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] avatar Jul 15 '24 00:07 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

github-actions[bot] avatar Jul 29 '24 00:07 github-actions[bot]

mark not stale

sungwy avatar Aug 21 '24 15:08 sungwy

As a workaround, to manually retry commits, update the table metadata by using

table = table.refresh()

before calling commit() again

kevinjqliu avatar Oct 09 '24 17:10 kevinjqliu

Have also have experience not being able write to tables in highly distributed environments. Refreshing the table in isolation, in addition to adding some retry logic did not work. The solution we found to work involved:

  1. Refreshing the table.
  2. Creating a new transaction from the refreshed table.
  3. Generating new snapshots for the data files involved in the previous transaction.
  4. Trying to commit again, if fails go back to step 1.

maxlucuta avatar Oct 09 '24 18:10 maxlucuta

@maxlucuta Yes, that's what I have been using. It would be nice if there would be an internal retry for the commit so the client application doesn't have to be polluted with the retry.

It would be a difficulty to handle the local table metadata object if the retry is pushed to the library level. The commit would succeed, but the metadata object would point to a whole different snapshot.

mark-major avatar Oct 10 '24 13:10 mark-major

Also interested in such feature!

marieamelie94 avatar Jan 16 '25 10:01 marieamelie94

@maxlucuta would you be able to share what that code looks like? Really interested to try that strategy out.

potatochipcoconut avatar Jun 21 '25 01:06 potatochipcoconut

I have implemented commit retry based on the Java implementation, including:

  • Configurable commit retry settings via table properties
  • Logic to reload the latest table metadata on CommitFailedException and rebuild snapshots and requirements without Datafile regeneration
  • Retry handling for both autocommit and transaction code paths

Looking at the current PR #330, it seems some time has passed since the last update, so I’d like to open a new PR from my branch, referencing that work.

Would this be an acceptable way to proceed?

KazuhitoT avatar Nov 27 '25 12:11 KazuhitoT

@Fokko I have created a draft PR #2794 that implements the full set of changes. However, the diff is quite large, so I plan to split it into a series of smaller PRs as follows:

  1. Retry utilities
    • Add RetryConfig, run_with_retry, and run_with_suppressed_failure helpers.
    • Unit tests for the retry utilities.
  2. Transaction / Update refactoring (no behavior change)
    • Introduce _working_metadata, _pending_updates, and _StaticUpdate to Transaction.
    • Add abstract _reset_state method and TableRequirement.key() to enable replay.
    • All existing tests should pass with no behavior change.
  3. Transaction-level commit retry for Update* class
    • Add retry-related properties to TableProperties.
    • Implement _commit_with_retry() and _reapply_updates() in Transaction.
    • Implement _reset_state() and _operations for each Update* class (e.g., UpdateSchema, UpdateSpec) in order to maintain and refresh the operation states.
  4. SnapshotProducer retry
    • Refactor _SnapshotProducer to separate apply() and _build_updates().
    • Implement retry logic for autocommit operations.
    • Add _refresh_state() and _reset_state() for snapshot-specific state.

This will likely result in 5–6 PRs. If this overall direction looks reasonable, I would like to create the first PR for the retry utilities.

KazuhitoT avatar Dec 08 '25 12:12 KazuhitoT