flink
flink copied to clipboard
[FLINK-33892][runtime] Support Job Recovery from JobMaster Failures for Batch Jobs.
What is the purpose of the change
(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)
Brief change log
Support Job Recovery from JobMaster Failures for Batch Jobs.
Verifying this change
This change added tests and can be verified by BatchJobRecoveryTest and JMFailoverITCase.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (yes / no)
- The public API, i.e., is any changed class annotated with
@Public(Evolving)
: (yes / no) - The serializers: (yes / no / don't know)
- The runtime per-record code paths (performance sensitive): (yes / no / don't know)
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
- The S3 file system connector: (yes / no / don't know)
Documentation
- Does this pull request introduce a new feature? (yes / no)
- If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)
CI report:
- 4c8b906988f8ff112d01e10f87cbd40c28d8ff60 Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:-
@flinkbot run azure
re-run the last Azure build
Thanks @zhuzhurk for reviewing, I've updated this pr accordingly. PTAL.
Thanks @zhuzhurk for the thorough review. I have refactored the BatchJobRecoveryTest and JMFailoverITCase based on your comments. PTAL.