flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-35552][runtime] Moves CheckpointStatsTracker out of DefaultExecutionGraphFactory into Scheduler

Open XComp opened this issue 1 year ago • 7 comments

PR Chain

  • FLINK-35550: https://github.com/apache/flink/pull/24909
  • FLINK-35551: https://github.com/apache/flink/pull/24910
  • ⭐ FLINK-35552: https://github.com/apache/flink/pull/24911
  • FLINK-35553: https://github.com/apache/flink/pull/24912

What is the purpose of the change

The AdaptiveScheduler needs to have access to the CheckpointsStatsTracker to monitor checkpoint-related events.

Brief change log

  • Refactors CheckpointStatsTracker constructor to not rely on the total subtask count anymore when initializing the tracker
  • Moves CheckpointStatsTracker ownership from DefaultExecutionGraphFactory to the scheduler implementations
  • Makes CheckpointStatsTracker an "implementation detail" of the execution graph that's not exposed through API.

Verifying this change

  • Existing tests are covering the change.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

XComp avatar Jun 07 '24 11:06 XComp

CI report:

  • f52650d9a1586e971d5736a0ac69dc4a06f03bc4 UNKNOWN
  • 3764e627199e89cbb72aaaf9bc47e8fee7097704 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Jun 07 '24 11:06 flinkbot

@flinkbot run azure

XComp avatar Jun 20 '24 06:06 XComp

Force-pushed the rebase onto the most-recent version of base PR #24910

$ git rebase --onto=FLINK-35551 6af9560d62f963fc8d85d26807627aa452932fcf

XComp avatar Jun 24 '24 12:06 XComp

@flinkbot run azure

XComp avatar Jun 25 '24 06:06 XComp

I rebased the branch to master after the base PR #24910 was merged to master.

XComp avatar Jun 27 '24 20:06 XComp

Observed CI failures documented:

  • FLINK-25453 for the SqlGatewayE2ECase.testMaterializedTableInFullMode
  • FLINK-35722 for the CoordinatorEventsToStreamOperatorRecipientExactlyOnceITCase.testCheckpoint

XComp avatar Jun 28 '24 13:06 XComp

@flinkbot run azure

XComp avatar Jul 01 '24 07:07 XComp

CI with AdaptiveScheduler enabled was successful. I'm gonna go ahead and prepare this PR to be merged (i.e. remove the DO-NOT-MERGE commit).

XComp avatar Jul 03 '24 05:07 XComp