iceberg-rust
iceberg-rust copied to clipboard
Support commit retrie
I would like to separate this task into multiple steps:
- [ ] Identify the RetryableCommitError type.
We can introduce a new
ErrorKind::RetryableCommitErrorto abstract kinds of catalog errors. - [ ] Support to store the update actions and reapply them to the table when the commit fails.
- [ ] Add retry commit, this requires a retry library. About the retry library, personally, I think https://github.com/Xuanwo/backon can be a good candidate. Its maintainer is @Xuanwo. (Thanks for this great job!)
Welcome more suggestions and elaborations. cc @Fokko @Xuanwo @liurenjie1024 @sdd
Thanks @ZENOTME for raising this. The core part of commit of conflict detection for different isolation levels, which is quite hard to implement correctly. Retry itself is not a big problem, it's just a reload of table metadata and do conflict detection. I'm not familiar with backon, so not sure if it's a fit here.
Thanks @ZENOTME for raising this. The core part of commit of conflict detection for different isolation levels, which is quite hard to implement correctly. Retry itself is not a big problem, it's just a reload of table metadata and do conflict detection. I'm not familiar with
backon, so not sure if it's a fit here.
Thanks @liurenjie1024. I will take some effort to investigate more about conflict detection.
Hi, I take some time to figure out the whole commit process. The whole commit phase can be described as follows:
- load current metadata from the catalog
- create UpdateAction and apply them to the metadata
- When applying, there is the conflict detection process based on the current local metadata load at step 1
- The conflict detection process is specific for the Update Action type. e.g. FastAppend just appends data files so it doesn't have conflict detection.
- If conflict detection in the apply function fails, it means that the table has some conflict and we can't commit. This process abort
- If conflict detection passes, we can send the commit message to the catalog.
- If the commit fails and the catalog returns CommitFailedException, which means that we commit with stale metadata we can jump back to step 1 and try to commit again.
- If commit success, then done.
For now, we only support FastAppend. So we can complete the whole process based on the FastAppend first and complete conflict detection when we introduce other update actions. How do you think? @liurenjie1024
For now, we only support FastAppend. So we can complete the whole process based on the FastAppend first and complete conflict detection when we introduce other update actions.
This sounds reasonable to me.
For now, we only support FastAppend. So we can complete the whole process based on the FastAppend first and complete conflict detection when we introduce other update actions.
This sounds reasonable to me.
Let's move. I will work on this later
@ZENOTME Are you currently working on this? I'm looing to make some progress on this
- We can introduce a new
ErrorKind::RetryableCommitErrorto abstract kinds of catalog errors.
Hi, I believe it should not be an extra kind. Instead, we can add a retryable field in error.
@ZENOTME Are you currently working on this? I'm looing to make some progress on this
Hi @jonathanc-n, for the transaction part, we better waiting for #949 to merge first because it involves interface change. Maybe you can on the retryable error first.
- We can introduce a new
ErrorKind::RetryableCommitErrorto abstract kinds of catalog errors.Hi, I believe it should not be an extra kind. Instead, we can add a
retryablefield in error.
LGTM!
create UpdateAction and apply them to the metadata
- When applying, there is the conflict detection process based on the current local metadata load at step 1
- The conflict detection process is specific for the Update Action type. e.g. FastAppend just appends data files so it doesn't have conflict detection.
- If conflict detection in the apply function fails, it means that the table has some conflict and we can't commit. This process abort
I'm a bit confused about this part. First of all should we do a operation match and not have to append requirements if there are only append operations.
What do you think about a metadata check (a check if new metadata was already committed, if table already exists, etc.) before we append updates or requirements.
The process would be to:
- Call refresh on the table metadata in
SnapshotProduceAction- This would mean that we would need to pass in a reference of
CatalogtoTableBuilderto be able to perform the refresh. SnapshotProduceActioncan use its transaction instance to use its referencedTableto call the refresh.
- This would mean that we would need to pass in a reference of
- Perform the metadata check before appending the updates and requirements to the transaction. This will check the metadata that was used during the snapshot produce process against the current refreshed metadata.
- If it fails we can call a
CommitExceptionErrorand have it be retried. Update and requirements never get appended to the Transaction so we do not need to connect to catalog.
Any alternate suggestions or fixes? cc @liurenjie1024 @Fokko @Xuanwo @sdd @ZENOTME @c-thiel
Sorry for replying late.
I'm a bit confused about this part. First of all should we do a operation match and not have to append requirements if there are only append operations.
I think we still need an append requirement for the append operation. It just doesn't need the conflict detection, but it still needs a requirement. According to my understanding, the requirement is to check whether there are concurrent snapshot updates. It uses TableRequirement::UuidMatch and TableRequirement::RefSnapshotIdMatch for this.
If the catalog identifies that uuid didn't match, it means that there are concurrent updates. The catalog returns CommitExceptionError. and we should load new metadata and do the conflict detection. If there is conflict then the commit will fail. Otherwise, we can reapply(generate a new update and requirement) and re commit again.
If it fails we can call a CommitExceptionError and have it be retried.
If metadata check fail, I think we should abort because it means that not also there are concurrent update but also conflict. As do in java: https://github.com/apache/iceberg/blob/6d8653bb2bbcecdbe38e98da60ef5578f083e933/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L512