iceberg Core: HadoopTable needs to skip file cleanup after task failure under some boundary conditions.

Core: HadoopTable needs to skip file cleanup after task failure under some boundary conditions.

Jan 23 '24 03:01 BsoBird

New Commit by https://github.com/apache/iceberg/pull/9333 FIX: https://github.com/apache/iceberg/issues/9327

Design doc(By RussellSpitzer): https://github.com/apache/iceberg/pull/9546#issuecomment-1919357852

Based on our discussions last night on slack, I believe that we are basically in the following situation
commit() {
// Failure here means we throw CommitFailedException everything is broken no commit

Rename to N+1 Metadata.json // Exceptions here are where we have a Commit State Unknown because we can't be sure whether or not the rename actually succeeded

// Failure after rename here means all clients will still read the wrong hint out of version hint 
// and not be able to see the n+1 meatdata json until they attempt to commit. This is a Java-Impl 
// bug imho but does not violate spec as they will not break the history although the commit will 
// remain invisible until a subsequent commit. This will be painful for some operations like COW 
// that are heavy but not a violation of spec. Commit has succeeded though so no exceptions 
// should be thrown.

Write new version Hint

// Failure here is completely fine, commit has succeeded. 
// We successfully wrote a new version file so even this client will read the new state

Remove old metadata

// Failure here is also fine, Commit has succeeded we just didn't clean up nice
}
So this basically means we need to ignore all exceptions that occur after the "rename" and for exceptions that occur "during" the rename those have the possibility of being commit state unknown if we can't check for the new metadata existing.

This means that the following tests need to pass

// For all tests assume we start with a table at n-metadata.json

During the commit renameFile throws an exception but the n+1 metadata json is written a. Unknown State Exception must be thrown if we can't find n+1-metadata.json for any reason other than File not Found or Commit Succeeded if we can b. A subsequent commit must produce n+2-metadata.json

During the commit renameFile throws an exception but the n+1 metadata.json is not written a. Unknown State Exception must be thrown if we can't find n+1.metadata.json for any reason other than file not found, Commit Failed otherwise b. Subsequent commit must produce n+1-metadata.json

Any Exceptions are thrown after renameFile a. All exceptions should be caught b. Commit Succeeded should be returned c. Subsequent commits should produce n+2-metadata.json

I think that covers everything, we can always split out the 1.a and 2.a into different cases where the check for n+1 fails or succeeds.

Just realized there is a concurrency issue as well and we can't just check if n+1-metadata.json exists since another process may have made this file while we were throwing an exception. So if we wanted to really check we have to check that n+1-metadata.json is the same as the metadata we were trying to commit

The version hint issue we should probably move to another issue. I think it's good idea for each committer to proactively remove version-hint since in 99% of the time it's about to write a new one that will point to it's own new file. Then in 1% of cases where our rename succeeds but our new version hint write fails, a client will still see the newest commit. Then in the 0.001% cases where the whole operation breaks clients will still work but have slightly worse performance.

Jan 23 '24 03:01 BsoBird

Based on discussions with other community members, the current submission process looks something like this:

graph TD
A[commit-start] -->B{try-get-lock}
B --> |success| C{check-version-file-exists}
B --> |fail| D(commit-fail)
C -->|already-exists| D[commit-fail]
C -->|not-exists| E{useObjectStore?}
E -->|yes| F{commit-new-version}
E -->|no| G{checkCurrentVersionIsLatest}
F --> |success| F1{useObjectStore}
G --> |yes| F{commit-new-version}
G --> |no| D(commit-fail)
F1 --> |yes| F11(clean-old-metadata)
F1 --> |no| F12{this commit is dirty commit?}
F12 -->|yes| F121[clean dirty commit]
F121 --> D(commit-fail)
F12 --> |no| F11(clean old metadata)
F11 --> F2(commit-success)
F --> |fail| D1{recheck}
D1 --> |check-success-and-commit-success| F1{useObjectStore}
D1 --> |check-success-and-commit-fail| D(commit-fail)
D1 --> |check-fail| D19(commit-state-unknow)
D --> END01[UNLOCK]
F2 --> END01[UNLOCK]
D19 --> END01[UNLOCK]
END01 --> END[END]

Jan 23 '24 03:01 BsoBird

@RussellSpitzer Hello. can you check this? Tks.

Jan 28 '24 06:01 BsoBird

Needs re-formating, remember to run gradle build-x test -x integrationTest to check your code style and static analysis.

General fix seems good to me, I do think on reading the code and seeing how the tests perform there are actually some Spec issues in HadoopTableOperations. I don't think we should fix them in this PR but probably immediately afterwards.

Tests seem to have some extraneous bits we need to clean up but look good as well.

Thank you. Sir.

Jan 31 '24 03:01 BsoBird

Based on our discussions last night on slack, I believe that we are basically in the following situation

commit() {
// Failure here means we throw CommitFailedException everything is broken no commit

Rename to N+1 Metadata.json // Exceptions here are where we have a Commit State Unknown because we can't be sure whether or not the rename actually succeeded

// Failure after rename here means all clients will still read the wrong hint out of version hint 
// and not be able to see the n+1 meatdata json until they attempt to commit. This is a Java-Impl 
// bug imho but does not violate spec as they will not break the history although the commit will 
// remain invisible until a subsequent commit. This will be painful for some operations like COW 
// that are heavy but not a violation of spec. Commit has succeeded though so no exceptions 
// should be thrown.

Write new version Hint

// Failure here is completely fine, commit has succeeded. 
// We successfully wrote a new version file so even this client will read the new state

Remove old metadata

// Failure here is also fine, Commit has succeeded we just didn't clean up nice
}

So this basically means we need to ignore all exceptions that occur after the "rename" and for exceptions that occur "during" the rename those have the possibility of being commit state unknown if we can't check for the new metadata existing.

This means that the following tests need to pass

// For all tests assume we start with a table at n-metadata.json

During the commit renameFile throws an exception but the n+1 metadata json is written a. Unknown State Exception must be thrown if we can't find n+1-metadata.json for any reason other than File not Found or Commit Succeeded if we can b. A subsequent commit must produce n+2-metadata.json
During the commit renameFile throws an exception but the n+1 metadata.json is not written a. Unknown State Exception must be thrown if we can't find n+1.metadata.json for any reason other than file not found, Commit Failed otherwise b. Subsequent commit must produce n+1-metadata.json
Any Exceptions are thrown after renameFile a. All exceptions should be caught b. Commit Succeeded should be returned c. Subsequent commits should produce n+2-metadata.json

I think that covers everything, we can always split out the 1.a and 2.a into different cases where the check for n+1 fails or succeeds.

Jan 31 '24 15:01 RussellSpitzer

Just realized there is a concurrency issue as well and we can't just check if n+1-metadata.json exists since another process may have made this file while we were throwing an exception. So if we wanted to really check we have to check that n+1-metadata.json is the same as the metadata we were trying to commit

Jan 31 '24 15:01 RussellSpitzer

@RussellSpitzer Hello Sir. I resubmitted the code. Please check my implementation of this version.

Feb 02 '24 06:02 BsoBird

@RussellSpitzer Hello. sir. I've optimised the hadoopCatalog implementation and I now believe that its execution behaviour is basically SPEC compliant. We don't need the CommitStateUnknownException anymore. Its logic is now simpler too. However, I have changed some of the code behaviour and the definition of the logic of how the commit succeeds. If you have the time, please review this version of my code. Together we can evaluate whether the current implementation makes sense.

I am currently replacing the order of execution of metadata operations from writeMeta -> deleteHint -> writeHite to deleteHint -> writeMeta -> writeHint. The reason for this is:

In most cases, we can tell if another client is executing in parallel by querying for the existence of a versionHint. We can stop them as soon as possible.
According to the previous logic, if there is an anomaly where the metaData is written and the versionHint modification fails. The version in versionHint is behind the real meta version. At this point the table can no longer be committed.

Note that in dropOldVersionHint and writeNewVersionMeta, we use the atomic operations fs.rename and fs.delete to solve the concurrency problem. Only one client can succeed.

findOldVersionHint-------------------->isFirstRun------->JobFail
       |                 NOT EXISTS        |        NO
       | yes                               |
       |                                   |YES
       ↓                 NO                |
checkNextVersionIsLatest----->JobFail      ↓
       |                              nextVersionIsLatest------->JobFail
       | yes                               |                NO
       ↓              NO                   |YES
dropOldVersionHint----------->JobFail      |
       |                                   |
       |<----------------------------------|
       | yes
       ↓
writeNewVersionMeta-------------------->JobFail
       |                 NO
       |
       | yes
       ↓
writeNewVersionHint--------------->|
       |                  NO       |
       |                           |
       | yes                       |
       ↓                           |
cleanOldMeta---------------------->|
       |                NO         |
       |                           |
       | yes                       |
       ↓                           |
    SUCCESS<-----------------------|

Feb 03 '24 11:02 BsoBird

If the user's fs.rename call is taking too long (slow file system response) and the user has set the metadata-ttl to be too short, then it is possible that the user will commit an older version of the metadata successfully. Suppose client A calls fs.rename, and the current client is ready to commit v2. However, because of the slow response of the file system, it takes 30 seconds for the operation to complete. At this point, there is another client B that is committing every second. And the TTL for this table is only 2 versions. When client A calls successfully, assuming that the maximum version has been rolled over to V60, and there are only two versions of metadata, v60 and v59, then fs.rename(v2) will succeed because v2 no longer exists and there is no conflict. Therefore, the final result is: the table retains all three versions of metadata v60, v59, v2. When this happens, is it acceptable? @RussellSpitzer

Feb 05 '24 02:02 BsoBird

I added exception catching for fs.rename calls. If the fs.rename call succeeds, I will swallow all exceptions.

Feb 07 '24 17:02 BsoBird

@RussellSpitzer Hello Sir, I have tried to limit the scope of the PR changes to what I consider to be the minimum. If there is still a need to further reduce the scope of the modification, I think we need to discuss it. I have not initiated a request-review at this time. If you would like me to initiate a request-review, please let me know.

Feb 20 '24 07:02 BsoBird

@pvary Hi. can you check this?

Mar 01 '24 14:03 BsoBird

@nastra Hi.If you have time, can you check this? tks.

May 27 '24 09:05 BsoBird