iceberg Update version-hint.txt atomically

We discussed in #1465 that it would be good to find a way to update the version-hint.txt atomically

The version file should not be corrupt and we should make sure of it by changing how we write the file. Since this is for HDFS, creating the file with an atomic rename to ensure the entire file is written before making it the current version hint makes sense to me. That way we don't get dirty reads.

Checked a few things and here is what I have found (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/filesystem.html#boolean_rename.28Path_src.2C_Path_d.29)

Destination exists and is a file

Renaming a file atop an existing file is specified as failing, raising an exception.
- Local FileSystem : the rename succeeds; the destination file is replaced by the source file.
- HDFS : The rename fails, no exception is raised. Instead the method call simply returns false.

Based on this I think our best solution would be:

Create a new file
Delete old version-hint.txt
Move the new file to version-hint.txt

This would mean that for a while (between 2-3) we will have a short period when the fallback listing created by #1465 will kick in and use listing to get the version, or alternatively we can add retry logic to getting the version-hint.txt file in place finally.

Your thoughts @jacques-n, @fbocse, @rdblue?

Thanks, Peter

Sep 23 '20 12:09 pvary

I think deleting the version hint and renaming a new one in its place is a good idea. I don't think that there is an atomic operation that could do this. If there was, we would use it to maintain a pointer to the version instead of just a hint. We have to delete the file, but at least a rename would ensure that readers don't use the file when it is half written. And if the rename fails, it's okay to give up.

Sep 23 '20 23:09 rdblue

@pvary thank you for following up on this - I think that the solution you are considering reproduces the behaviour we are seeing for the server-side implementation of create(overwrite=true) used by Iceberg to override the version-hint.txt file. I don't think that in our particular case this is necessarily going to be a benefit at the "expense" of a couple of more hdfs requests.

However your thoughts got me thinking too, especially with the new fallback behaviour in place. I wouldn't necessarily advertise using more APIs than necessary to implement optimistic locking because we're also exposing more edge-cases, I was thinking about this one in particular, accounting for your proposal:

-1. writer reads version-hint.txt (with content 17) 0. writer create(overwrite=false) v18.metadata.json

writer create(overwrite=false) file version-hint-18.txt (with content 18)
writer deletes version-hint.txt (with content 17)
writer moves version-hint-18.txt to version-hint.txt (with content 18)

If between 2 and 3 a NEW writer attempts to commit it will fail to load version-hint.txt and it will fallback to directory listing instead - then it finds v18.metadata.json as the greatest version and continues to write v19.metadata.json and to commit its new version to the version-hint.txt file. If it so happens that the NEW writer also manages to commit its own 2 and 3 steps (unlikely but not impossible, think GC pause?) before the initial writer does we might end up with version 19 being overridden by version 18. This is a classic dead-lock. This basically "locks" the table for any subsequent writers - cause new writers will find version-hint.txt and load value 18, increment that value to resolve the version they should write a new metadata file for but fail to do so since there already is a v19.metadata.json file.

I think that this behaviour is present today should the client exit unexpectedly right before replacing version-hint.txt. But to that edge-case this implementation adds an edge case of a different category, it relates to the distributed nature of writers attempting to replace the same resource.

Also looking at your suggestion I think that if any writer exits unexpectedly before step 3 I assume version-hint.txt file is never going to be materialized and all subsequent writers will fallback to directory listing for version resolution, right? I assume that resolving the version-hint.txt file doesn't involve also to write the version to the file as well.

My suggestion is we'd also extract the current implementation of HadoopTableOperations:writeVersionHint and HadoopTableOperations:readVersionHint to something we can override by using the APIs that provide the right atomic and consistency guarantees - say dir listing provides read-after-write guarantees and we also get atomic create(overwrite=false) guarantees we can implement version resolution on top of those - each new version is a new version file, every writer attempts to delete older versions than say the last 10 so we keep dir listing in constant time

Oct 14 '20 20:10 fbocse

Thanks @fbocse for the review!

@pvary thank you for following up on this

Actually @lcspinter is the one who is working on this 😄

This basically "locks" the table for any subsequent writers - cause new writers will find version-hint.txt and load value 18, increment that value to resolve the version they should write a new metadata file for but fail to do so since there already is a v19.metadata.json file.

I think this should not be a problem as we use version-hint.txt only as a hint and in HadoopTableOperations.refresh() we iterate through on the possible metadata locations and stop only if there is not new snapshot is available:

      Path nextMetadataFile = getMetadataFile(ver + 1);
      while (nextMetadataFile != null) {
        ver += 1;
        metadataFile = nextMetadataFile;
        nextMetadataFile = getMetadataFile(ver + 1);
      }

I hope this loop solves the dead-lock problem you mentioned above. What do you think?

Thanks, Peter

Oct 19 '20 11:10 pvary

@pvary ah yes, you are right, there's no dead-lock on account of out-of-order writes from committer to metadata version files (snapshots) or version-hint text file based on that code snippet. Regardless of the version hint the code will try to detect the most recent snapshot by incrementally traversing snapshot files, right? So worst case is what? Version hint is behind from the most recent snapshot? Readers still get to the most recent snapshot though based on the snippet you've shared, just have to pay the cost to traverse the snapshots list (this can be reduced by a subsequent writer to align version hint with most recent version). LGTM

Oct 26 '20 09:10 fbocse

I notice that https://iceberg.apache.org/spec/ doesn't mention the version-hint.txt file. Should the spec outline when a reader/writer should use the file?

Jul 28 '21 20:07 WilliamWhispell

@WilliamWhispell, that file should be entirely optional, which is why it isn't in the spec. It is merely a pointer to help avoid lots of work finding the current version.

Jul 28 '21 21:07 rdblue

Hey, i know this is not the same issue but im getting "WARN HadoopTableOperations: Error reading version hint file ../../../../warehouse/catalog/my-catalog/metadata/version-hint.text" and further "py4j.protocol.Py4JJavaError: An error occurred while calling o40.table. : java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z"

Does anyone knows anithing related?

May 05 '23 08:05 getAlexRibeiro

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

Feb 26 '24 00:02 github-actions[bot]

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

Mar 12 '24 00:03 github-actions[bot]

iceberg iceberg copied to clipboard

Update version-hint.txt atomically

iceberg
iceberg copied to clipboard