flink-on-k8s-operator icon indicating copy to clipboard operation
flink-on-k8s-operator copied to clipboard

flink-operator-controller-manager pods in CrashLoopBackOff status

Open yanghui16355 opened this issue 4 years ago • 6 comments

I found flink operator manager pods occasionally will change to CrashLoopBackOff status in couple situations:

  1. When I update a running flink pipeline, I got timeout error from operator and it shows operator pods in CrashLoopBackOff status, I have to manually delete the operator pods and then it will create new one in running status.
  2. Operator pods will crash and stuck every 6-12 hours and pipeline deployed by operator also been impact. Sometimes it will recover automatically but sometime I need to manually delete the operator pods to force recreating.
  3. I found flink pipeline deployed by operator will redeploy every few hours, and sometimes it will fail due to operator crash

The status for the operator is like following: ITUS000040-MAC:kubectl huiyang$ kubectl get pods,svc -n flink-operator-system NAME READY STATUS RESTARTS AGE pod/flink-operator-controller-manager-6886b99d68-2ktzd 1/2 CrashLoopBackOff 112 43h

As you can see it restarted many times after it deployed and one of them is crash. I checked the log of operator pods but not found anything specific error for it. I found couple times it is in OOM status before crash that maybe it has memory leak issue?

I enabled auto savepointing, not sure if it will impact

Can you provide suggest about how to debug this issue?

Thanks,

Hui

yanghui16355 avatar Nov 30 '20 23:11 yanghui16355

@functicons

yanghui16355 avatar Dec 01 '20 22:12 yanghui16355

I fixed it after I increase the resource for operator manager pod, the initial allocated memory is only 20Mi which looks too low for it to be stable. FYI @functicons

yanghui16355 avatar Dec 04 '20 23:12 yanghui16355

@yanghui16355 How did you increase as I don't see the operator manager pod itself expose resources in the CDR.

Only see the job & task allow for resources to be specified.

ckdarby avatar Jan 08 '21 16:01 ckdarby

@ckdarby I changed the source code of operator and build the image by my own.

yanghui16355 avatar Jan 28 '21 19:01 yanghui16355

Fixed it by increasing resources for memory requests and limits to 128Mi and 256Mi respectively https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/0310df76d6e2128cd5d2bc51fae4e842d370c463/helm-chart/flink-operator/templates/flink-operator.yaml#L345-L351

pashtet04 avatar Nov 01 '21 12:11 pashtet04

@pashtet04 I believe those values aren't exposed as a part of the chart but I'm using helm's --post-renderer to send into kustomize where we're able to change the manifest templates.

ckdarby avatar Nov 01 '21 12:11 ckdarby