[FLINK-27101][checkpointing][rest] Add restful API to trigger checkpoints
What is the purpose of the change
This pull request adds a restful API for one-time checkpointing triggering. This allows user to break incremental checkpoint chain by triggering a manual full checkpoint. Restful API delegates the scheduling to user.
Brief change log
- The checkpoint trigger restful API is async, similar to savepoint restful API
- The checkpoint trigger API takes a CheckpointBackupType and will only trigger checkpoint once
- A new public triggerCheckpoint method taking a CheckpointBackupType is added to CheckpointCoordinator
Verifying this change
This change added tests and can be verified as follows:
- Added unit tests to validate CheckpointTriggerHandler and CheckpointTriggerStatusHandler restful handlers
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving): yes - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? yes
- If yes, how is the feature documented? generated docs for restful API
CI report:
- 7c84a40a9c328cb8d9af188fabb0995d88dad189 UNKNOWN
- 8f1487914cc12b81ad2cc884af50fd2151b5fa4b Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
Thanks a lot for the contribution @leletan . It looks mostly good to me, I've left a couple of comments. It would be great if someone else (maybe @zentol ) could also take a look at the REST API changes.
Also there seems to be a build failure.
If yes, how is the feature documented? [WIP]
Let's not forget about this :)
Thanks for the super fast response, @pnowojski !
Do we have a standard convention/process about restful API feature?
- The only thing I found so far is this README. I am still a little bit confused after reading the README. Do we have an example for that?
- Any restful API documentation other than above we need to add?
Also I am wondering if this change would need any further test coverages? If yes, where would be the best place I should start looking?
re documentation: You basically just run mvn package -Dgenerate-rest-docs -pl flink-docs -am -nsu -DskipTests locally.
@pnowojski @zentol Addressed the code comments (at least I thought I did) and generated the docs. Could you please take a look whenever you have time? Thanks!
API was modified in a compatible way, but the snapshot was not updated. To update the snapshot, re-run this test with -Dgenerate-rest-snapshot being set.
Run "mvn verify -Dgenerate-rest-snapshot -pl flink-runtime-web -nsu", or RuntimeRestAPIStabilityTest with the property being set.
Essentially there are 2 files we generate: the documentation (with my last comment) and a snapshot for comparing compatibility (we cant use the docs because we don't persist everything we need IIRC).
Hi folks, @pnowojski @stevenzwu @zentol Thanks for your review. I believe I addressed all comments (including ones about docs and tests), except the pending naming of the old CheckpointType class name.
Personally I don't have any strong opinion about the class name so I just marked it as @Internal for now. I am happy to update it after we reach a consensus.
Meanwhile please let me know if everything else looks good to you.
Thanks!
Thanks again folks. All comments should be address, please let me know if there is anything else I need to do for this PR.
@leletan I think we probably should make the CheckpointType and SavepointType renaming as a separate PR. merge that one first then come back on this one. it gets difficult to review the diff when the two changes are mixed together. touching so many files.
@stevenzwu All your comments should be addressed. Please let me know if there is anything missing. Thanks!
@pnowojski Your comments and change requests should all be addressed. Please let me know if I miss anything. Thanks!
As per discussion https://github.com/apache/flink/pull/20852#issuecomment-1260402519, will address the internal class renaming in a follow up PR.
@leletan I thought we will come back to this PR after the internal class renaming is merged, unless @pnowojski thinks we can proceed with this first. Please create a jira for the renaming.
@flinkbot run azure
Thanks, @pnowojski and @stevenzwu! Combing the feedbacks and here is my plan:
- fix the build
- rebase all the changes so far into one commit.
- do a follow up commit for the class renaming in this PR
Let me know if you have concern about above.
@flinkbot run azure
Hi folks, @pnowojski @stevenzwu @zentol Thanks for your review again. I believe I addressed all comments, except the pending naming of the old CheckpointType class name (will add in a later commit to make the code easier)
One big change is, I added one more CheckpointType as CONFIGURED, to represent using the existing internal CheckpointSnapshotType as configured by flink application to triggerCheckpoint, replacing previous legacy minicluster case represented by null.
Please let me know if there is anything else I am missing.
@zentol I see I have a lot of style issues and thanks for pointing them out. I am wondering if you have a doc / pointer for those style rules (in addition to the auto style check and spotless) so I won't make those mistakes in the future.
@zentol Thanks for the detailed feedbacks, really appreciated them! I should have addressed all the comments in this round, including filing this radar. Please take another look when you have a chance.
@flinkbot run azure