iceberg-rust icon indicating copy to clipboard operation
iceberg-rust copied to clipboard

Support commit retrie

Open ZENOTME opened this issue 9 months ago • 11 comments

I would like to separate this task into multiple steps:

  1. [ ] Identify the RetryableCommitError type. We can introduce a new ErrorKind::RetryableCommitError to abstract kinds of catalog errors.
  2. [ ] Support to store the update actions and reapply them to the table when the commit fails.
  3. [ ] Add retry commit, this requires a retry library. About the retry library, personally, I think https://github.com/Xuanwo/backon can be a good candidate. Its maintainer is @Xuanwo. (Thanks for this great job!)

Welcome more suggestions and elaborations. cc @Fokko @Xuanwo @liurenjie1024 @sdd

ZENOTME avatar Feb 12 '25 05:02 ZENOTME

Thanks @ZENOTME for raising this. The core part of commit of conflict detection for different isolation levels, which is quite hard to implement correctly. Retry itself is not a big problem, it's just a reload of table metadata and do conflict detection. I'm not familiar with backon, so not sure if it's a fit here.

liurenjie1024 avatar Feb 12 '25 10:02 liurenjie1024

Thanks @ZENOTME for raising this. The core part of commit of conflict detection for different isolation levels, which is quite hard to implement correctly. Retry itself is not a big problem, it's just a reload of table metadata and do conflict detection. I'm not familiar with backon, so not sure if it's a fit here.

Thanks @liurenjie1024. I will take some effort to investigate more about conflict detection.

ZENOTME avatar Feb 12 '25 12:02 ZENOTME

Hi, I take some time to figure out the whole commit process. The whole commit phase can be described as follows:

  1. load current metadata from the catalog
  2. create UpdateAction and apply them to the metadata
    • When applying, there is the conflict detection process based on the current local metadata load at step 1
    • The conflict detection process is specific for the Update Action type. e.g. FastAppend just appends data files so it doesn't have conflict detection.
    • If conflict detection in the apply function fails, it means that the table has some conflict and we can't commit. This process abort
  3. If conflict detection passes, we can send the commit message to the catalog.
    • If the commit fails and the catalog returns CommitFailedException, which means that we commit with stale metadata we can jump back to step 1 and try to commit again.
  4. If commit success, then done.

For now, we only support FastAppend. So we can complete the whole process based on the FastAppend first and complete conflict detection when we introduce other update actions. How do you think? @liurenjie1024

ZENOTME avatar Feb 12 '25 17:02 ZENOTME

For now, we only support FastAppend. So we can complete the whole process based on the FastAppend first and complete conflict detection when we introduce other update actions.

This sounds reasonable to me.

liurenjie1024 avatar Feb 13 '25 08:02 liurenjie1024

For now, we only support FastAppend. So we can complete the whole process based on the FastAppend first and complete conflict detection when we introduce other update actions.

This sounds reasonable to me.

Let's move. I will work on this later

ZENOTME avatar Feb 13 '25 09:02 ZENOTME

@ZENOTME Are you currently working on this? I'm looing to make some progress on this

jonathanc-n avatar Mar 18 '25 22:03 jonathanc-n

  1. We can introduce a new ErrorKind::RetryableCommitError to abstract kinds of catalog errors.

Hi, I believe it should not be an extra kind. Instead, we can add a retryable field in error.

Xuanwo avatar Mar 19 '25 04:03 Xuanwo

@ZENOTME Are you currently working on this? I'm looing to make some progress on this

Hi @jonathanc-n, for the transaction part, we better waiting for #949 to merge first because it involves interface change. Maybe you can on the retryable error first.

ZENOTME avatar Mar 19 '25 05:03 ZENOTME

  1. We can introduce a new ErrorKind::RetryableCommitError to abstract kinds of catalog errors.

Hi, I believe it should not be an extra kind. Instead, we can add a retryable field in error.

LGTM!

ZENOTME avatar Mar 19 '25 05:03 ZENOTME

  1. create UpdateAction and apply them to the metadata

    • When applying, there is the conflict detection process based on the current local metadata load at step 1
    • The conflict detection process is specific for the Update Action type. e.g. FastAppend just appends data files so it doesn't have conflict detection.
    • If conflict detection in the apply function fails, it means that the table has some conflict and we can't commit. This process abort

I'm a bit confused about this part. First of all should we do a operation match and not have to append requirements if there are only append operations.

What do you think about a metadata check (a check if new metadata was already committed, if table already exists, etc.) before we append updates or requirements.

The process would be to:

  1. Call refresh on the table metadata in SnapshotProduceAction
    • This would mean that we would need to pass in a reference of Catalog to TableBuilder to be able to perform the refresh.
    • SnapshotProduceAction can use its transaction instance to use its referenced Table to call the refresh.
  2. Perform the metadata check before appending the updates and requirements to the transaction. This will check the metadata that was used during the snapshot produce process against the current refreshed metadata.
  3. If it fails we can call a CommitExceptionError and have it be retried. Update and requirements never get appended to the Transaction so we do not need to connect to catalog.

Any alternate suggestions or fixes? cc @liurenjie1024 @Fokko @Xuanwo @sdd @ZENOTME @c-thiel

jonathanc-n avatar Mar 29 '25 22:03 jonathanc-n

Sorry for replying late.

I'm a bit confused about this part. First of all should we do a operation match and not have to append requirements if there are only append operations.

I think we still need an append requirement for the append operation. It just doesn't need the conflict detection, but it still needs a requirement. According to my understanding, the requirement is to check whether there are concurrent snapshot updates. It uses TableRequirement::UuidMatch and TableRequirement::RefSnapshotIdMatch for this.

If the catalog identifies that uuid didn't match, it means that there are concurrent updates. The catalog returns CommitExceptionError. and we should load new metadata and do the conflict detection. If there is conflict then the commit will fail. Otherwise, we can reapply(generate a new update and requirement) and re commit again.

If it fails we can call a CommitExceptionError and have it be retried.

If metadata check fail, I think we should abort because it means that not also there are concurrent update but also conflict. As do in java: https://github.com/apache/iceberg/blob/6d8653bb2bbcecdbe38e98da60ef5578f083e933/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L512

ZENOTME avatar Mar 31 '25 15:03 ZENOTME