alluxio
alluxio copied to clipboard
Force metadata sync when data read fails due to out-of-date length
What changes are proposed in this pull request?
Notify master to invalidate the metadata when there is an out of range reading exception in client.
Why are the changes needed?
Read method can be failed if the metadata in Alluxio Master is inconsistent with that in UFS. This may happen when the UFS file is updated (to a shorter length) bypassing Alluxio. For example, the in-UFS file is 200KB and Alluxio master loads the metadata by a metadata sync and provides that to the Alluxio client. Before the client reads that file with an Alluxio worker, the in-UFS file is replaced to be 100KB only. Then the Alluxio worker will fail to read the 100KB-200KB part and that IOException will be propagated to the Alluxio client. This happens very often when the compute application opens an AlluxioFileInStream then wait a long time before really reading that stream, because when the client reads from an Alluxio worker, the data transmission doesn't validate the metadata again.
There's nothing we can do when such a failure is met, if it happens in the middle of a stream, we cannot rollback the stream or even tell if what we have read is correct at all. The mitigation is to give up on this stream but force the next time this file will be metadata-synced. We do that by the client telling the master this file should be metadata synced, by invalidating the current metadata.
Does this PR introduce any user facing changes?
No, it doesn't.
Automated checks report:
- Commits associated with Github account: PASS
- PR title follows the conventions: FAIL
- The title of the PR does not pass all the checks. Please fix the following issues:
- Title is too long (98 characters). Must be at most 72 characters.
- First word of title ("Notify") is not an imperative verb. Please use one of the valid words
- The title of the PR does not pass all the checks. Please fix the following issues:
Some checks failed. Please fix the reported issues and reply 'alluxio-bot, check this please' to re-run checks.
Automated checks report:
- Commits associated with Github account: PASS
- PR title follows the conventions: PASS
All checks passed!
The code here is WIP. We verified this works in a manual repro. The next step will be to add UTs and format the code.
high level comment, why do we only force sync when the length has changed to be greater? why do we not also sync when the length changed shorter? is it because it is harder to detect? (no exception will be thrown when reading? ) @jiacheliu3 @JiamingMai
high level comment, why do we only force sync when the length has changed to be greater? why do we not also sync when the length changed shorter? is it because it is harder to detect? (no exception will be thrown when reading? ) @jiacheliu3 @JiamingMai
@yuzhu In fact this PR only solves the problem when the in-UFS file has been truncated (replaced with shorter length) and there's no cache(because if there's cache, the outdated cache will be read instead of UFS file). This change does NOT solve the problem where the file becomes longer or the same length, which is logged in a separate issue: https://github.com/Alluxio/alluxio/issues/16225
alluxio-bot, merge this please