flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-27101][checkpointing][rest] Add restful API to trigger checkpoints

Open leletan opened this issue 2 years ago • 6 comments

What is the purpose of the change

This pull request adds a restful API for one-time checkpointing triggering. This allows user to break incremental checkpoint chain by triggering a manual full checkpoint. Restful API delegates the scheduling to user.

Brief change log

  • The checkpoint trigger restful API is async, similar to savepoint restful API
  • The checkpoint trigger API takes a CheckpointBackupType and will only trigger checkpoint once
  • A new public triggerCheckpoint method taking a CheckpointBackupType is added to CheckpointCoordinator

Verifying this change

This change added tests and can be verified as follows:

  • Added unit tests to validate CheckpointTriggerHandler and CheckpointTriggerStatusHandler restful handlers

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? generated docs for restful API

leletan avatar Sep 19 '22 07:09 leletan

CI report:

  • 7c84a40a9c328cb8d9af188fabb0995d88dad189 UNKNOWN
  • 8f1487914cc12b81ad2cc884af50fd2151b5fa4b Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Sep 19 '22 07:09 flinkbot

Thanks a lot for the contribution @leletan . It looks mostly good to me, I've left a couple of comments. It would be great if someone else (maybe @zentol ) could also take a look at the REST API changes.

Also there seems to be a build failure.

If yes, how is the feature documented? [WIP]

Let's not forget about this :)

Thanks for the super fast response, @pnowojski !

Do we have a standard convention/process about restful API feature?

  • The only thing I found so far is this README. I am still a little bit confused after reading the README. Do we have an example for that?
  • Any restful API documentation other than above we need to add?

leletan avatar Sep 20 '22 04:09 leletan

Also I am wondering if this change would need any further test coverages? If yes, where would be the best place I should start looking?

leletan avatar Sep 20 '22 04:09 leletan

re documentation: You basically just run mvn package -Dgenerate-rest-docs -pl flink-docs -am -nsu -DskipTests locally.

zentol avatar Sep 20 '22 07:09 zentol

@pnowojski @zentol Addressed the code comments (at least I thought I did) and generated the docs. Could you please take a look whenever you have time? Thanks!

leletan avatar Sep 22 '22 03:09 leletan

API was modified in a compatible way, but the snapshot was not updated. To update the snapshot, re-run this test with -Dgenerate-rest-snapshot being set.

Run "mvn verify -Dgenerate-rest-snapshot -pl flink-runtime-web -nsu", or RuntimeRestAPIStabilityTest with the property being set.

Essentially there are 2 files we generate: the documentation (with my last comment) and a snapshot for comparing compatibility (we cant use the docs because we don't persist everything we need IIRC).

zentol avatar Sep 22 '22 11:09 zentol

Hi folks, @pnowojski @stevenzwu @zentol Thanks for your review. I believe I addressed all comments (including ones about docs and tests), except the pending naming of the old CheckpointType class name. Personally I don't have any strong opinion about the class name so I just marked it as @Internal for now. I am happy to update it after we reach a consensus. Meanwhile please let me know if everything else looks good to you. Thanks!

leletan avatar Sep 27 '22 06:09 leletan

Thanks again folks. All comments should be address, please let me know if there is anything else I need to do for this PR.

leletan avatar Sep 28 '22 01:09 leletan

@leletan I think we probably should make the CheckpointType and SavepointType renaming as a separate PR. merge that one first then come back on this one. it gets difficult to review the diff when the two changes are mixed together. touching so many files.

stevenzwu avatar Sep 28 '22 05:09 stevenzwu

@stevenzwu All your comments should be addressed. Please let me know if there is anything missing. Thanks!

leletan avatar Sep 30 '22 19:09 leletan

@pnowojski Your comments and change requests should all be addressed. Please let me know if I miss anything. Thanks!

leletan avatar Sep 30 '22 19:09 leletan

As per discussion https://github.com/apache/flink/pull/20852#issuecomment-1260402519, will address the internal class renaming in a follow up PR.

leletan avatar Sep 30 '22 19:09 leletan

@leletan I thought we will come back to this PR after the internal class renaming is merged, unless @pnowojski thinks we can proceed with this first. Please create a jira for the renaming.

stevenzwu avatar Sep 30 '22 20:09 stevenzwu

@flinkbot run azure

leletan avatar Oct 03 '22 15:10 leletan

Thanks, @pnowojski and @stevenzwu! Combing the feedbacks and here is my plan:

  1. fix the build
  2. rebase all the changes so far into one commit.
  3. do a follow up commit for the class renaming in this PR

Let me know if you have concern about above.

leletan avatar Oct 03 '22 16:10 leletan

@flinkbot run azure

leletan avatar Oct 08 '22 07:10 leletan

Hi folks, @pnowojski @stevenzwu @zentol Thanks for your review again. I believe I addressed all comments, except the pending naming of the old CheckpointType class name (will add in a later commit to make the code easier) One big change is, I added one more CheckpointType as CONFIGURED, to represent using the existing internal CheckpointSnapshotType as configured by flink application to triggerCheckpoint, replacing previous legacy minicluster case represented by null.

Please let me know if there is anything else I am missing.

leletan avatar Oct 10 '22 17:10 leletan

@zentol I see I have a lot of style issues and thanks for pointing them out. I am wondering if you have a doc / pointer for those style rules (in addition to the auto style check and spotless) so I won't make those mistakes in the future.

leletan avatar Oct 11 '22 17:10 leletan

@zentol Thanks for the detailed feedbacks, really appreciated them! I should have addressed all the comments in this round, including filing this radar. Please take another look when you have a chance.

leletan avatar Oct 13 '22 02:10 leletan

@flinkbot run azure

zentol avatar Oct 13 '22 17:10 zentol