flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-35553][runtime] Wires up the RescaleManager with the CheckpointLifecycleListener interface

Open XComp opened this issue 8 months ago • 5 comments

PR Chain

  • FLINK-35550: https://github.com/apache/flink/pull/24909
  • FLINK-35551: https://github.com/apache/flink/pull/24910
  • FLINK-35552: https://github.com/apache/flink/pull/24911
  • ⭐ FLINK-35553: https://github.com/apache/flink/pull/24912

What is the purpose of the change

Make rescale be synchronized with the checkpoint creation for faster recovery.

Brief change log

  • Introduced new CheckpointLifecyclListener that allows the AdaptiveScheduler to monitor checkpoint completion
  • RescaleManager.Context.onTrigger will be called if a checkpoint was completed or if a configured amount of subsequent failed checkpoints appeared (new configuration parameter: jobmanager.adaptive-scheduler.rescale-on-failed-checkpoints-count)

Verifying this change

Additional tests were added to check the trigger behavior in ExecutingTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? configuration docs

XComp avatar Jun 07 '24 11:06 XComp