cloudflow icon indicating copy to clipboard operation
cloudflow copied to clipboard

Restore Apache Flink job state from a given savepoint

Open Lockdain opened this issue 4 years ago • 0 comments

Is your feature request related to a problem? Please describe. There is a great apache Flink feature called "savepoints" (see here). In a nutshell this feature allows to manually trigger any stateful job to save its state to any persistent storage (PVC, HDFS or S3).

The Flink K8s operator (provided by Lyft and used in Cloudflow) uses savepoints in many ways, for example during job rescaling (see Lyft's Flink on K8s state machine here).

The savepoint can be also triggered via console. @MaxSbk has mentioned the case previously. For long-running stateful jobs savepointing is a solution for creating a historical snapshots during the job lifecycle. Imagine that case when a user wants to persist the state manually each new feature released. It looks fruitful since in case of any bug will be detected in production it'll be possible to change the faulty code, deploy it to the Flink cluster and restore its state from the savepoint made just before faulty feature was released.

We also believe that PVC doesn't claim on a good candidate for a long-term persistent storage. HDFS or S3-compatible storage look a way more applicable and handy.

So in a nutshell we request a feature which will allow users to manually trigger savepoints recovery.

Is your feature request related to a specific runtime of cloudflow or applicable for all runtimes? The feature is related to the Apache Flink streamlets.

Describe the solution you'd like A great solution would be to support the feature provided by Flink, I mean the opportunity to restore job state from any supported storage types.

Describe alternatives you've considered No other solutions are observable.

Additional context No additional context provided.

Lockdain avatar Oct 29 '20 16:10 Lockdain