flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-33668] Decoupling Shuffle network memory and job topology

Open jiangxin369 opened this issue 2 years ago • 2 comments

What is the purpose of the change

This pull request decouples shuffle network memory and job topology which resolves the issue

Brief change log

  • Redefines the redistribution logic of the NetworkBufferPool.
  • Make the shuffle read and write components work with a small buffer pool.
  • Adds a config to enable this feature.
  • Avoids a potential deadlock. It happens when there are not enough segments in the global buffer pool, while an InputGate is reserving segments, so it needs to wait for another LBP destroyed. Destroying a LBP will cause a redistribution over all LBPs, which requires the same lock during reserving segments, so a deadlock happens.

Verifying this change

This change added tests and can be verified as follows:

  • Buffer pool redistribution can allocate buffers with the expected number as the weight and make sure it is between the min and the max value.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (yes)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

jiangxin369 avatar Dec 01 '23 10:12 jiangxin369

CI report:

  • 0fb53b38d35d7ddbf992d217ed85561008b48f54 UNKNOWN
  • eb152c2113f394f7ed7b5489e9dcb4327dadcc4f Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Dec 01 '23 10:12 flinkbot

The CI has some failures in the module flink-table-planner, but I cannot reproduce it locally. I suspect it was caused by some unexpected configuration. I'll keep troubleshooting it, but it should not block the review.

jiangxin369 avatar Jan 17 '24 01:01 jiangxin369