flinkk8soperator icon indicating copy to clipboard operation
flinkk8soperator copied to clipboard

Non-streaming jobs, Beam and Checkpointing

Open prkadalb opened this issue 5 years ago • 5 comments

Hello,

I'm trying to use the flinkk8soperator with Beam (with Flink being the runner).

The operator is able to launch the Job Manager and Task manager pods and can submit the job as well. It works fine for streaming applications.

However, when I try to run a batch application, it turns out that Beam does not enable checkpointing in Flink.

The k8s operator, however, assumes that checkpointing is turned on, and throws an error as the checkpoint API returns a HTTP 404. https://github.com/lyft/flinkk8soperator/blob/f499e7f2ff5c2f7b2e84e458d08ffdb1df2d22b9/pkg/controller/flink/flink.go#L535

https://github.com/apache/flink/blob/7aafb248770070f0fc1bb2bd49d7bbffbb873699/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/checkpoints/CheckpointingStatisticsHandler.java#L94

https://github.com/apache/beam/blob/7b3a3fa6c9291692b56dbc358dfc075724b993b6/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkExecutionEnvironments.java#L223

Is it possible to let the operator know somehow that checkpoints are not enabled, and that a 404 error on the checkpoint API is not fatal?

Thanks!

prkadalb avatar Nov 19 '19 11:11 prkadalb

cc @tweise

anandswaminathan avatar Nov 21 '19 23:11 anandswaminathan

@anandswaminathan we should support jobs that don't enable checkpointing. There are also streaming use cases where it makes sense to not enable checkpointing.

tweise avatar Nov 22 '19 00:11 tweise

@tweise @prkadalb

We can definitely find a way to indicate that. Also I believe there is a small bug with respect to deletion of Finished jobs as well.

What do you think is the best way for the operator to identify that - a job is batch job and that checkpointing is disabled. Also if you have ideas - feel free to submit a PR. @mwylde and I would be happy to review.

anandswaminathan avatar Nov 22 '19 00:11 anandswaminathan

https://github.com/lyft/flinkk8soperator/issues/138

tweise avatar Dec 07 '19 01:12 tweise

Is this still an issue? We are using the operator with a Beam streaming job that does not have checkpointing enabled and also recently added the option to skip savepoint during upgrade: https://github.com/lyft/flinkk8soperator/pull/184

tweise avatar Mar 21 '20 03:03 tweise