flink
flink copied to clipboard
[FLINK-33668] Decoupling Shuffle network memory and job topology
What is the purpose of the change
This pull request decouples shuffle network memory and job topology which resolves the issue
Brief change log
- Redefines the redistribution logic of the NetworkBufferPool.
- Make the shuffle read and write components work with a small buffer pool.
- Adds a config to enable this feature.
- Avoids a potential deadlock. It happens when there are not enough segments in the global buffer pool, while an InputGate is reserving segments, so it needs to wait for another LBP destroyed. Destroying a LBP will cause a redistribution over all LBPs, which requires the same lock during reserving segments, so a deadlock happens.
Verifying this change
This change added tests and can be verified as follows:
- Buffer pool redistribution can allocate buffers with the expected number as the weight and make sure it is between the min and the max value.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (no)
- The public API, i.e., is any changed class annotated with
@Public(Evolving): (yes) - The serializers: (no)
- The runtime per-record code paths (performance sensitive): (yes)
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
- The S3 file system connector: (no)
Documentation
- Does this pull request introduce a new feature? (no)
- If yes, how is the feature documented? (not applicable)
CI report:
- 0fb53b38d35d7ddbf992d217ed85561008b48f54 UNKNOWN
- eb152c2113f394f7ed7b5489e9dcb4327dadcc4f Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
The CI has some failures in the module flink-table-planner, but I cannot reproduce it locally. I suspect it was caused by some unexpected configuration. I'll keep troubleshooting it, but it should not block the review.