alluxio icon indicating copy to clipboard operation
alluxio copied to clipboard

Force metadata sync when data read fails due to out-of-date length

Open JiamingMai opened this issue 2 years ago • 3 comments

What changes are proposed in this pull request?

Notify master to invalidate the metadata when there is an out of range reading exception in client.

Why are the changes needed?

Read method can be failed if the metadata in Alluxio Master is inconsistent with that in UFS. This may happen when the UFS file is updated (to a shorter length) bypassing Alluxio. For example, the in-UFS file is 200KB and Alluxio master loads the metadata by a metadata sync and provides that to the Alluxio client. Before the client reads that file with an Alluxio worker, the in-UFS file is replaced to be 100KB only. Then the Alluxio worker will fail to read the 100KB-200KB part and that IOException will be propagated to the Alluxio client. This happens very often when the compute application opens an AlluxioFileInStream then wait a long time before really reading that stream, because when the client reads from an Alluxio worker, the data transmission doesn't validate the metadata again.

There's nothing we can do when such a failure is met, if it happens in the middle of a stream, we cannot rollback the stream or even tell if what we have read is correct at all. The mitigation is to give up on this stream but force the next time this file will be metadata-synced. We do that by the client telling the master this file should be metadata synced, by invalidating the current metadata.

Does this PR introduce any user facing changes?

No, it doesn't.

JiamingMai avatar Sep 02 '22 09:09 JiamingMai

Automated checks report:

  • Commits associated with Github account: PASS
  • PR title follows the conventions: FAIL
    • The title of the PR does not pass all the checks. Please fix the following issues:
      • Title is too long (98 characters). Must be at most 72 characters.
      • First word of title ("Notify") is not an imperative verb. Please use one of the valid words

Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.

alluxio-bot avatar Sep 02 '22 09:09 alluxio-bot

Automated checks report:

  • Commits associated with Github account: PASS
  • PR title follows the conventions: PASS

All checks passed!

alluxio-bot avatar Sep 03 '22 12:09 alluxio-bot

The code here is WIP. We verified this works in a manual repro. The next step will be to add UTs and format the code.

jiacheliu3 avatar Sep 06 '22 15:09 jiacheliu3

high level comment, why do we only force sync when the length has changed to be greater? why do we not also sync when the length changed shorter? is it because it is harder to detect? (no exception will be thrown when reading? ) @jiacheliu3 @JiamingMai

yuzhu avatar Sep 23 '22 12:09 yuzhu

high level comment, why do we only force sync when the length has changed to be greater? why do we not also sync when the length changed shorter? is it because it is harder to detect? (no exception will be thrown when reading? ) @jiacheliu3 @JiamingMai

@yuzhu In fact this PR only solves the problem when the in-UFS file has been truncated (replaced with shorter length) and there's no cache(because if there's cache, the outdated cache will be read instead of UFS file). This change does NOT solve the problem where the file becomes longer or the same length, which is logged in a separate issue: https://github.com/Alluxio/alluxio/issues/16225

jiacheliu3 avatar Sep 26 '22 03:09 jiacheliu3

alluxio-bot, merge this please

jiacheliu3 avatar Sep 26 '22 03:09 jiacheliu3