flink
flink copied to clipboard
[FLINK-35553][runtime] Wires up the RescaleManager with the CheckpointLifecycleListener interface
PR Chain
- FLINK-35550: https://github.com/apache/flink/pull/24909
- FLINK-35551: https://github.com/apache/flink/pull/24910
- FLINK-35552: https://github.com/apache/flink/pull/24911
- ⭐ FLINK-35553: https://github.com/apache/flink/pull/24912
What is the purpose of the change
Make rescale be synchronized with the checkpoint creation for faster recovery.
Brief change log
- Introduced new
CheckpointLifecyclListener
that allows theAdaptiveScheduler
to monitor checkpoint completion -
RescaleManager.Context.onTrigger
will be called if a checkpoint was completed or if a configured amount of subsequent failed checkpoints appeared (new configuration parameter:jobmanager.adaptive-scheduler.rescale-on-failed-checkpoints-count
)
Verifying this change
Additional tests were added to check the trigger behavior in ExecutingTest
.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving)
: no - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? configuration docs