flink-on-k8s-operator icon indicating copy to clipboard operation
flink-on-k8s-operator copied to clipboard

Question: what restartPolicy is needed for?

Open karpoftea opened this issue 3 years ago • 2 comments

Hi, I've read crd docs but can't understand what does it mean when the job fails, thus can't understand what is the purpose of restartPolicy. Can you kindly explain why buildin checkpointing mechanics + HA is not enough to recover from failure and we need restartPolicy=FromSavepointOnFailure? If this property covers completely another case, please can you explain by example? Thanks!

karpoftea avatar Jan 15 '22 11:01 karpoftea

Hi Ilia, I just want to share my thoughts: I believe they are in different level:

  1. checkpointing + HA(for standalone cluster) is managed by Flink itself.
  2. restartPolicy is managed by this flink-on-k8s-operator(i.e. k8s) (codes can be found here. AFAIK, option1 should be enough if we configure it correctly like creating 2 JM and a zk service. Option2 is a good try to utilize k8s's potential. And due to the git history, it may be implemented pretty early when Flink's HA is not so good.

Besides, it is worthwhile to mention that Flink community also does some work in k8s HA like this. And since 1.12, Flink even supports native k8s HA. I am also interested in the question that if this operator can support such usage.

bgeng777 avatar Jan 17 '22 11:01 bgeng777

Thanks for sharing! I'm running flink 1.14 using this operator, it is per-job mode with 1 jm and k8s HA. I delete jm pod, k8s created new one and job continued to work from the place it stopped before. That lead me to ask a question about cases of restartPolicy usage. May be you are right and it is applicable to older versions of flink, but it's great to know for sure.

karpoftea avatar Jan 18 '22 13:01 karpoftea