Thank you for this Project.

The PR changes the code in such a way that the Flink API endpoint /jobs/:jobid/savepoints/:triggerid is used to retrieve the name of the latest savepoint when updating a job.

How I imagine the deployer API should be used:

Intent	Deployer command(s)	Deployer behaviour
Update running job with new version (using an up-to-date savepoint)	`update` with job-name-base	Deployer cancels job and creates a savepoint in one operation. Deployer starts new job with savepoint.
Restart a job from an existing savepoint	`cancel` (not implemented yet) + `deploy` with savepoint-path	Deployer cancels the running job. Deployer starts a new job with the savepoint defined by a CLI argument.
Start a job without using a savepoint	`deploy`	Deployer starts new job without savepoint.

I removed the logic for retrieval of the most current savepoint through the file system completely. I think using the Flink API is a more flexible approach because you don't need to mount a volume with the snapshots, therefore allowing the usage of different storage solutions for savepoints / checkpoints such as blob stores like AWS S3 or Google Cloud Storage.

Also, the job name base mechanism didn't seem functional yet. It should work as advertised now.

Docker image: https://hub.docker.com/r/nicktriller/flink-deployer/

Nov 09 '18 07:11 Nick-Triller

Codecov Report

Merging #22 into master will decrease coverage by 2.44%. The diff coverage is 95.23%.

@@            Coverage Diff            @@
##           master     #22      +/-   ##
=========================================
- Coverage   55.85%   53.4%   -2.45%     
=========================================
  Files          12      11       -1     
  Lines         478     440      -38     
=========================================
- Hits          267     235      -32     
+ Misses        172     168       -4     
+ Partials       39      37       -2

Impacted Files	Coverage Δ
cmd/cli/main.go	`27.18% <0%> (-0.78%)`	:arrow_down:
cmd/cli/operations/update_job.go	`88.73% <100%> (-0.61%)`	:arrow_down:
cmd/cli/operations/deploy.go	`75% <100%> (-1.32%)`	:arrow_down:
cmd/cli/flink/savepoint.go	`77.77% <100%> (+0.63%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 56a4d94...7b7556a. Read the comment docs.

Nov 09 '18 08:11 codecov-io

Hi Nick,

Thanks for the contribution. Both me and Niels have had quite a few debates about how to properly implement the savepoint retrieval. Referring to your table, option 2 would work if cancel would return the full savepoint-path. If not, restarting from an existing savepoint would require the caller of the deploy action to know the full path to the savepoint. That means that an external system would need to store the savepoint path and pass it to the deployer. This is something we didn't want to impose on our users immediately.

This problem also applies to supporting a deploy with a savepoint path without cancelling first. There's no way of knowing the full path to the savepoint.

Nov 23 '18 10:11 mrooding

@mrooding I think Nicks intentions were correct. We should be retrieving savepoint path from Flink API. There are some edge cases where two consecutive savepoints are triggered, or savepoints are triggered for different jobs (parallel invocation of flink-deployer, or one from flink-deployer and one from flinkUI), where simply retrieving the latest savepoint will not give the correct savepoint required to restart the job. Also, Nick's change support all savepoint schemes that flink supports and will support in the future (s3, hdfs) without having to implement specifically for each scheme like I saw in the other PR created by kerinin. Also currently flink-deployer only supports local file system, where as must flink deployments these days run in k8s where there is no local storage. What we need is part of Nick's changes merged into the latest code base:

Leave the deploy action alone
Update action will not call terminate, but trigger a savepoint and cancel the job as part of that trigger (Nick's change)
Monitor savepoint will return in the response, the location of savepoint (because we monitor by triggerid); there is no chance of retrieving the wrong savepoint.
Update need not call retrieveLatestSavepoint

What do you think? If Nick is busy I can do it.

Sep 17 '19 15:09 joshuavijay

Hi @joshuavijay! Thanks for your input. I haven't been working much with Flink in 2019, therefore the details of the relevant mechanisms aren't on the top of my mind. Feel free to reuse any parts that are useful to you.

Sep 17 '19 21:09 Nick-Triller

flink-deployer
flink-deployer copied to clipboard

Use Flink API to retrieve savepoint name

Codecov Report

flink-deployer flink-deployer copied to clipboard

Use Flink API to retrieve savepoint name

Codecov Report

flink-deployer
flink-deployer copied to clipboard