[FLINK-35553][runtime] Wires up the RescaleManager with the CheckpointLifecycleListener interface
PR Chain
- FLINK-35550: https://github.com/apache/flink/pull/24909
- FLINK-35551: https://github.com/apache/flink/pull/24910
- FLINK-35552: https://github.com/apache/flink/pull/24911
- ⭐ FLINK-35553: https://github.com/apache/flink/pull/24912
What is the purpose of the change
Make rescale be synchronized with the checkpoint creation for faster recovery.
Brief change log
- Introduced new
CheckpointLifecyclListenerthat allows theAdaptiveSchedulerto monitor checkpoint completion RescaleManager.Context.onTriggerwill be called if a checkpoint was completed or if a configured amount of subsequent failed checkpoints appeared (new configuration parameter:jobmanager.adaptive-scheduler.rescale-on-failed-checkpoints-count)
Verifying this change
Additional tests were added to check the trigger behavior in ExecutingTest.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving): no - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? configuration docs
CI report:
- 236bd38329ceaed5495b158b8891dcc292afc9c1 Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
@ztison I addressed your comments. PTAL
I rebased to most-recent version of the base PR #24911
I added another commit to PR #24911 where i introduce CheckpointStatsTracker as an interface. This allows us to get rid of CachingSupplier and makes it easier to test the main thread execution of the AdaptiveScheduler (https://github.com/apache/flink/pull/24912/commits/2bc1afc2bfefa0a02d9fcdb82d6d3006bf935e53) in this PR.
I addressed the last two comments and rebased the branch to most-recent version of parent PR #24911 . That way we also have the CI debug commit included
The most-recent forced-push was to rebase to the most-recent version of base PR #24911 to prepare for the final commit reorg in this PR.
The CI failure doesn't seem to be related to this PR. I created FLINK-35748 to cover the topic.
Rebased to masterafter base PR #24911 was merged.