flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-35553][runtime] Wires up the RescaleManager with the CheckpointLifecycleListener interface

Open XComp opened this issue 1 year ago • 5 comments

PR Chain

  • FLINK-35550: https://github.com/apache/flink/pull/24909
  • FLINK-35551: https://github.com/apache/flink/pull/24910
  • FLINK-35552: https://github.com/apache/flink/pull/24911
  • ⭐ FLINK-35553: https://github.com/apache/flink/pull/24912

What is the purpose of the change

Make rescale be synchronized with the checkpoint creation for faster recovery.

Brief change log

  • Introduced new CheckpointLifecyclListener that allows the AdaptiveScheduler to monitor checkpoint completion
  • RescaleManager.Context.onTrigger will be called if a checkpoint was completed or if a configured amount of subsequent failed checkpoints appeared (new configuration parameter: jobmanager.adaptive-scheduler.rescale-on-failed-checkpoints-count)

Verifying this change

Additional tests were added to check the trigger behavior in ExecutingTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? configuration docs

XComp avatar Jun 07 '24 11:06 XComp

CI report:

  • 236bd38329ceaed5495b158b8891dcc292afc9c1 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Jun 07 '24 11:06 flinkbot

@ztison I addressed your comments. PTAL

XComp avatar Jun 24 '24 17:06 XComp

I rebased to most-recent version of the base PR #24911

XComp avatar Jun 25 '24 10:06 XComp

I added another commit to PR #24911 where i introduce CheckpointStatsTracker as an interface. This allows us to get rid of CachingSupplier and makes it easier to test the main thread execution of the AdaptiveScheduler (https://github.com/apache/flink/pull/24912/commits/2bc1afc2bfefa0a02d9fcdb82d6d3006bf935e53) in this PR.

XComp avatar Jun 25 '24 19:06 XComp

I addressed the last two comments and rebased the branch to most-recent version of parent PR #24911 . That way we also have the CI debug commit included

XComp avatar Jun 28 '24 14:06 XComp

The most-recent forced-push was to rebase to the most-recent version of base PR #24911 to prepare for the final commit reorg in this PR.

XComp avatar Jul 02 '24 10:07 XComp

The CI failure doesn't seem to be related to this PR. I created FLINK-35748 to cover the topic.

XComp avatar Jul 03 '24 07:07 XComp

Rebased to masterafter base PR #24911 was merged.

XComp avatar Jul 03 '24 12:07 XComp

Azure CI succeeded with AdaptiveScheduler enabled GHA CI failed with an unrelated error: FLINK-34227

I'm gonna prepare the PR for merging. 👍

XComp avatar Jul 03 '24 16:07 XComp