flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-34518][runtime] Fixes AdaptiveScheduler#suspend bug when the job is suspended during Restarting phase

Open XComp opened this issue 11 months ago • 1 comments

What is the purpose of the change

See comment in FLINK-34518 for more details.

Brief change log

  • Overwrites the ExecutionGraph's state when suspending the job in the Restarting phase: The actually state might be CANCELLED which can result in a HA data cleanup because it's a globally-terminal state which we don't want when restarting the job. The cancellation of the ExecutionGraph is more like an implementation detail of the Restarting state and shouldn't be exposed.

Verifying this change

  • Added unit test to cover this scenario

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

XComp avatar Feb 26 '24 13:02 XComp

CI report:

  • 3176a981a1692542fef60ebad55d1b80e60c8d60 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Feb 26 '24 13:02 flinkbot