flinkk8soperator icon indicating copy to clipboard operation
flinkk8soperator copied to clipboard

Job Manager fails to start after container re-creation

Open Lockdain opened this issue 4 years ago • 0 comments

Hi, We use Lightbend Cloudflow which internally uses Lyft operator to manage Apache Flink based pipelines. There is a problem occurs when a job manager fails due to any problems while task managers remain unharmed.

Here the quote from the Cloudflow gitter:

https://gitter.im/lightbend/cloudflow?at=5f6b398f1c5b0d210ac736c0

When I recreate the jobmanager container of a streamlet in my pipeline (or it crashes for some reason), after starting, I see errors like this in the JobManager's logs:

2020-09-22T08:51:55.311Z ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [{}] - Exception occurred in REST handler: Job ed952687752d2a5b2c60d843d7e5605f8 not found
2020-09-22T08:52:25.485Z ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [{}] - Exception occurred in REST handler: Job ed952687752d2a5b2c60d843d7e5605f8 not found
2020-09-22T08:52:55.673Z ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [{}] - Exception occurred in REST handler: Job ed952687752d2a5b2c60d843d7e5605f8 not found

....

The directory with checkpoints exists and contains binary data and metadata: http://joxi.ru/LmGPZgeTlvj9G2 But the jobmanager doesn't find it. Тhe streamlet has the following settings: http://joxi.ru/eAOPkEYTkdegbr

In a prod environment we can't just redeploy the pipeline, because the data is important.

Lockdain avatar Oct 21 '20 21:10 Lockdain