flink
flink copied to clipboard
[FLINK-33879] Avoids the potential hang of Hybrid Shuffle during redistribution
What is the purpose of the change
This PR is to avoid the potential hang of Hybrid Shuffle during redistribution. The details about how a hang happens, please see issue.
Brief change log
- Adds an interface
ensureCapacityto TieredStorageMemoryManager to help reserve enough buffers. - Adds a variable
definitelyRecycledto indicate that the buffers of a tier can be definitely recycled even if there are no readers. - Reserves buffers every time a non-definitelyRecycled tier receives a buffer.
Verifying this change
This change added tests and can be verified as follows:
- Added test that validates that the ResultPartition would not hang even if most buffers are occupied by the memory tier and redistribution happens.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
@Public(Evolving): (no) - The serializers: (no)
- The runtime per-record code paths (performance sensitive): (yes)
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
- The S3 file system connector: (no)
Documentation
- Does this pull request introduce a new feature? (no)
- If yes, how is the feature documented? (not applicable)
CI report:
- 3556bc644e85bb1d25a1af537760ad97f1df0d00 Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
@TanYuxin-tyx Could you help review this PR?
@xintongsong Could you take a look at this PR?
@reswqa Thanks for the review, I've updated the PR, please take a look.
@reswqa Seems that the CI failed but the cause is related to python rather than this PR. Could you help merge it?