flink
flink copied to clipboard
[FLINK-27101][checkpointing][rest] Add restful API to trigger checkpoints
What is the purpose of the change
This pull request adds a restful API for one-time checkpointing triggering. This allows user to break incremental checkpoint chain by triggering a manual full checkpoint. Restful API delegates the scheduling to user.
Brief change log
- The checkpoint trigger restful API is async, similar to savepoint restful API
- The checkpoint trigger API takes a CheckpointBackupType and will only trigger checkpoint once
- A new public triggerCheckpoint method taking a CheckpointBackupType is added to CheckpointCoordinator
Verifying this change
This change added tests and can be verified as follows:
- Added unit tests to validate CheckpointTriggerHandler and CheckpointTriggerStatusHandler restful handlers
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving)
: yes - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? yes
- If yes, how is the feature documented? generated docs for restful API
CI report:
- 7c84a40a9c328cb8d9af188fabb0995d88dad189 UNKNOWN
- 8f1487914cc12b81ad2cc884af50fd2151b5fa4b Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:-
@flinkbot run azure
re-run the last Azure build
Thanks a lot for the contribution @leletan . It looks mostly good to me, I've left a couple of comments. It would be great if someone else (maybe @zentol ) could also take a look at the REST API changes.
Also there seems to be a build failure.
If yes, how is the feature documented? [WIP]
Let's not forget about this :)
Thanks for the super fast response, @pnowojski !
Do we have a standard convention/process about restful API feature?
- The only thing I found so far is this README. I am still a little bit confused after reading the README. Do we have an example for that?
- Any restful API documentation other than above we need to add?
Also I am wondering if this change would need any further test coverages? If yes, where would be the best place I should start looking?
re documentation: You basically just run mvn package -Dgenerate-rest-docs -pl flink-docs -am -nsu -DskipTests
locally.
@pnowojski @zentol Addressed the code comments (at least I thought I did) and generated the docs. Could you please take a look whenever you have time? Thanks!
API was modified in a compatible way, but the snapshot was not updated. To update the snapshot, re-run this test with -Dgenerate-rest-snapshot being set.
Run "mvn verify -Dgenerate-rest-snapshot -pl flink-runtime-web -nsu", or RuntimeRestAPIStabilityTest
with the property being set.
Essentially there are 2 files we generate: the documentation (with my last comment) and a snapshot for comparing compatibility (we cant use the docs because we don't persist everything we need IIRC).
Hi folks, @pnowojski @stevenzwu @zentol Thanks for your review. I believe I addressed all comments (including ones about docs and tests), except the pending naming of the old CheckpointType class name.
Personally I don't have any strong opinion about the class name so I just marked it as @Internal
for now. I am happy to update it after we reach a consensus.
Meanwhile please let me know if everything else looks good to you.
Thanks!
Thanks again folks. All comments should be address, please let me know if there is anything else I need to do for this PR.
@leletan I think we probably should make the CheckpointType
and SavepointType
renaming as a separate PR. merge that one first then come back on this one. it gets difficult to review the diff when the two changes are mixed together. touching so many files.
@stevenzwu All your comments should be addressed. Please let me know if there is anything missing. Thanks!
@pnowojski Your comments and change requests should all be addressed. Please let me know if I miss anything. Thanks!
As per discussion https://github.com/apache/flink/pull/20852#issuecomment-1260402519, will address the internal class renaming in a follow up PR.
@leletan I thought we will come back to this PR after the internal class renaming is merged, unless @pnowojski thinks we can proceed with this first. Please create a jira for the renaming.
@flinkbot run azure
Thanks, @pnowojski and @stevenzwu! Combing the feedbacks and here is my plan:
- fix the build
- rebase all the changes so far into one commit.
- do a follow up commit for the class renaming in this PR
Let me know if you have concern about above.
@flinkbot run azure
Hi folks, @pnowojski @stevenzwu @zentol Thanks for your review again. I believe I addressed all comments, except the pending naming of the old CheckpointType class name (will add in a later commit to make the code easier)
One big change is, I added one more CheckpointType as CONFIGURED
, to represent using the existing internal CheckpointSnapshotType as configured by flink application to triggerCheckpoint, replacing previous legacy minicluster case represented by null
.
Please let me know if there is anything else I am missing.
@zentol I see I have a lot of style issues and thanks for pointing them out. I am wondering if you have a doc / pointer for those style rules (in addition to the auto style check and spotless) so I won't make those mistakes in the future.
@zentol Thanks for the detailed feedbacks, really appreciated them! I should have addressed all the comments in this round, including filing this radar. Please take another look when you have a chance.
@flinkbot run azure