flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-33892][runtime] Support Job Recovery from JobMaster Failures for Batch Jobs.

Open JunRuiLee opened this issue 9 months ago • 1 comments

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

Support Job Recovery from JobMaster Failures for Batch Jobs.

Verifying this change

This change added tests and can be verified by BatchJobRecoveryTest and JMFailoverITCase.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

JunRuiLee avatar May 11 '24 07:05 JunRuiLee

CI report:

  • 4c8b906988f8ff112d01e10f87cbd40c28d8ff60 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar May 11 '24 07:05 flinkbot

Thanks @zhuzhurk for reviewing, I've updated this pr accordingly. PTAL.

JunRuiLee avatar May 23 '24 08:05 JunRuiLee

Thanks @zhuzhurk for the thorough review. I have refactored the BatchJobRecoveryTest and JMFailoverITCase based on your comments. PTAL.

JunRuiLee avatar May 25 '24 02:05 JunRuiLee