flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-33879] Avoids the potential hang of Hybrid Shuffle during redistribution

Open jiangxin369 opened this issue 2 years ago • 5 comments

What is the purpose of the change

This PR is to avoid the potential hang of Hybrid Shuffle during redistribution. The details about how a hang happens, please see issue.

Brief change log

  • Adds an interface ensureCapacity to TieredStorageMemoryManager to help reserve enough buffers.
  • Adds a variable definitelyRecycled to indicate that the buffers of a tier can be definitely recycled even if there are no readers.
  • Reserves buffers every time a non-definitelyRecycled tier receives a buffer.

Verifying this change

This change added tests and can be verified as follows:

  • Added test that validates that the ResultPartition would not hang even if most buffers are occupied by the memory tier and redistribution happens.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (yes)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

jiangxin369 avatar Dec 19 '23 10:12 jiangxin369

CI report:

  • 3556bc644e85bb1d25a1af537760ad97f1df0d00 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Dec 19 '23 10:12 flinkbot

@TanYuxin-tyx Could you help review this PR?

jiangxin369 avatar Dec 21 '23 01:12 jiangxin369

@xintongsong Could you take a look at this PR?

jiangxin369 avatar Dec 28 '23 02:12 jiangxin369

@reswqa Thanks for the review, I've updated the PR, please take a look.

jiangxin369 avatar Jan 11 '24 07:01 jiangxin369

@reswqa Seems that the CI failed but the cause is related to python rather than this PR. Could you help merge it?

jiangxin369 avatar Jan 12 '24 01:01 jiangxin369