[FLINK-33386][runtime] Support tasks balancing at slot level for Default Scheduler
What is the purpose of the change
- Support tasks balancing at slot level for Default Scheduler
Brief change log
- Introduce BalancedPreferredSlotSharingStrategy to support tasks balancing at slot level.
- Expose the configuration item to switch tasks balancing at slot level for Default Scheduler.
Verifying this change
This change added tests and can be verified as follows:
org.apache.flink.runtime.scheduler.BalancedPreferredSlotSharingStrategyTest
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (yes / no)
- The public API, i.e., is any changed class annotated with
@Public(Evolving): (yes / no) - The serializers: (yes / no / don't know)
- The runtime per-record code paths (performance sensitive): (yes / no / don't know)
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
- The S3 file system connector: (yes / no / don't know)
Documentation
- Does this pull request introduce a new feature? (yes / no)
- If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
CI report:
- 50f3dd8ed507a3ad36df53f013f900332b0d0b96 Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
Thank you @KarmaGYZ @1996fanrui very much for your comments. and I updated the PR based on your comments. Have a great weekend~ :)
Hi @KarmaGYZ , thanks for your hard review!
I think this PR contains two components. First would be a supplement of FLINK-33448. Second is part of the TASKS strategy. I think we may split it into two seperate commit.
Split it makes sense, it's clearer.
It would be better to include FLINK-33388 and introduce TASKS strategy.
Would you mind if we keep them into multiple PRs? I'm afraid one PR has a lot of commits and changes is hard to review. Of course, only one PR is acceptable for me.
Hi, @KarmaGYZ @1996fanrui Thank you very much for your patient review comments. I updated it based on your comments. PTAL in your free time,Have a nice weekend~
The waiting mechanism is ready for the review. Would you @KarmaGYZ @1996fanrui help take a look if you were in free time? Thank you very much~ And the verification part about the test would be refactored after external junit5 migrated.
Thank you @1996fanrui @KarmaGYZ very much for the review
I have re evaluated the implementation location of the waiting mechanisms based on @KarmaGYZ offline suggestions.
If two waiting mechanisms are placed in DeclarativeSlotPool, there would be preciser & conciser information to maintain.
- The maintenance of
reserve/freeslot/resource profiles should be simpler and more intuitive.
If we can reach an agreement on It, I would like to confirm again whether we still use mainThreadExecutor to complete the timeout waiting mechanism for checking? If so, this may require changing the create method of DeclarativeSlotPoolFactory
Please let me know your opinions.
@RocMarshal Just be curious about the progress, does this PR still wait for some comments to be addressed before it could be merged?
@RocMarshal Just be curious about the progress, does this PR still wait for some comments to be addressed before it could be merged?
This PR is in progress now. We plan to merge it after the complete Task Balancing feature is implemented.
👋 Hi, are there any updates or progress on this work as part of FLIP-370?
👋 Hi, are there any updates or progress on this work as part of FLIP-370?
thx for your attention. It's still in working. will update in the next few days.