flink-on-k8s-operator
flink-on-k8s-operator copied to clipboard
Flink Operator installation is failing
Hi Team, Flink operator installation in IBM cloud is failing with CrashLoopBackOff error. Please see below for more details:
$ k get all -n flink-operator-system
NAME READY STATUS RESTARTS AGE
pod/cert-job-wtvwr 0/1 Completed 0 11m
pod/flink-operator-controller-manager-848b69b444-86bf2 1/2 CrashLoopBackOff 5 4m29s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/flink-operator-controller-manager-metrics-service ClusterIP 172.21.152.91 <none> 8443/TCP 26m
service/flink-operator-webhook-service ClusterIP 172.21.210.155 <none> 443/TCP 26m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/flink-operator-controller-manager 0/1 1 0 26m
NAME DESIRED CURRENT READY AGE
replicaset.apps/flink-operator-controller-manager-848b69b444 1 1 0 26m
NAME COMPLETIONS DURATION AGE
job.batch/cert-job 1/1 3s 11m
$ k logs flink-operator-controller-manager-848b69b444-t5k2n -n flink-operator-system
Error from server (NotFound): pods "flink-operator-controller-manager-848b69b444-t5k2n" not found
[sumit@sumit flink]$ k logs flink-operator-controller-manager-848b69b444-86bf2 -n flink-operator-system --all-containers
I0729 12:43:04.599359 1 main.go:209] Generating self signed cert as no cert is provided
I0729 12:43:04.820946 1 main.go:242] Listening securely on 0.0.0.0:8443
W0729 12:49:19.067631 1 client_config.go:552] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2021-07-29T12:49:20.074Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "127.0.0.1:8080"}
2021-07-29T12:49:20.074Z INFO controller-runtime.builder Registering a mutating webhook {"GVK": "flinkoperator.k8s.io/v1beta1, Kind=FlinkCluster", "path": "/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster"}
2021-07-29T12:49:20.074Z INFO controller-runtime.webhook registering webhook {"path": "/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster"}
2021-07-29T12:49:20.074Z INFO controller-runtime.builder Registering a validating webhook {"GVK": "flinkoperator.k8s.io/v1beta1, Kind=FlinkCluster", "path": "/validate-flinkoperator-k8s-io-v1beta1-flinkcluster"}
2021-07-29T12:49:20.074Z INFO controller-runtime.webhook registering webhook {"path": "/validate-flinkoperator-k8s-io-v1beta1-flinkcluster"}
2021-07-29T12:49:20.074Z INFO setup Starting manager
2021-07-29T12:49:20.106Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"}
2021-07-29T12:49:20.106Z INFO controller-runtime.controller Starting EventSource {"controller": "flinkcluster", "source": "kind source: /, Kind="}
2021-07-29T12:49:20.106Z INFO controller-runtime.webhook.webhooks starting webhook server
2021-07-29T12:49:20.127Z INFO controller-runtime.certwatcher Updated current TLS certificate
2021-07-29T12:49:20.127Z INFO controller-runtime.webhook serving webhook server {"host": "", "port": 443}
2021-07-29T12:49:20.135Z INFO controller-runtime.certwatcher Starting certificate watcher
2021-07-29T12:49:20.307Z INFO controller-runtime.controller Starting EventSource {"controller": "flinkcluster", "source": "kind source: /, Kind="}
2021-07-29T12:49:21.208Z INFO controller-runtime.controller Starting EventSource {"controller": "flinkcluster", "source": "kind source: /, Kind="}
2021-07-29T12:49:21.309Z INFO controller-runtime.controller Starting EventSource {"controller": "flinkcluster", "source": "kind source: /, Kind="}
$ k describe pod flink-operator-controller-manager-848b69b444-86bf2 -n flink-operator-system
Name: flink-operator-controller-manager-848b69b444-86bf2
Namespace: flink-operator-system
Priority: 0
PriorityClassName: <none>
Node: 10.148.145.115/10.148.145.115
Start Time: Thu, 29 Jul 2021 18:13:02 +0530
Labels: app=flink-operator
control-plane=controller-manager
pod-template-hash=848b69b444
Annotations: cni.projectcalico.org/podIP: 172.30.149.218/32
cni.projectcalico.org/podIPs: 172.30.149.218/32
kubernetes.io/psp: ibm-privileged-psp
Status: Running
IP: 172.30.149.218
Controlled By: ReplicaSet/flink-operator-controller-manager-848b69b444
Containers:
kube-rbac-proxy:
Container ID: containerd://85ff7f6cf568dc376a0b248be8022e81bd3da48c83ea4461f050694c4a22acec
Image: gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0
Image ID: gcr.io/kubebuilder/kube-rbac-proxy@sha256:297896d96b827bbcb1abd696da1b2d81cab88359ac34cce0e8281f266b4e08de
Port: 8443/TCP
Host Port: 0/TCP
Args:
--secure-listen-address=0.0.0.0:8443
--upstream=http://127.0.0.1:8080/
--logtostderr=true
--v=10
State: Running
Started: Thu, 29 Jul 2021 18:13:04 +0530
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5c4cn (ro)
flink-operator:
Container ID: containerd://2880f9d0de68228476738c8ce01d6f82c36149709e60eb33b014b1aaad19a073
Image: gcr.io/flink-operator/flink-operator:latest
Image ID: gcr.io/flink-operator/flink-operator@sha256:af78aef1e6ca3e082f5d03b53db09fe0d31e21424ac87c9f0204b3739001d3cc
Port: 443/TCP
Host Port: 0/TCP
Command:
/flink-operator
Args:
--metrics-addr=127.0.0.1:8080
--watch-namespace=
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Thu, 29 Jul 2021 18:16:32 +0530
Finished: Thu, 29 Jul 2021 18:16:36 +0530
Ready: False
Restart Count: 5
Limits:
cpu: 100m
memory: 30Mi
Requests:
cpu: 100m
memory: 20Mi
Environment: <none>
Mounts:
/tmp/k8s-webhook-server/serving-certs from cert (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-5c4cn (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
cert:
Type: Secret (a volume populated by a Secret)
SecretName: webhook-server-cert
Optional: false
default-token-5c4cn:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-5c4cn
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 600s
node.kubernetes.io/unreachable:NoExecute for 600s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m8s default-scheduler Successfully assigned flink-operator-system/flink-operator-controller-manager-848b69b444-86bf2 to 10.148.145.115
Normal Pulled 4m6s kubelet, 10.148.145.115 Container image "gcr.io/kubebuilder/kube-rbac-proxy:v0.4.0" already present on machine
Normal Created 4m6s kubelet, 10.148.145.115 Created container kube-rbac-proxy
Normal Started 4m6s kubelet, 10.148.145.115 Started container kube-rbac-proxy
Normal Pulled 4m6s kubelet, 10.148.145.115 Successfully pulled image "gcr.io/flink-operator/flink-operator:latest" in 233.095305ms
Normal Pulled 4m kubelet, 10.148.145.115 Successfully pulled image "gcr.io/flink-operator/flink-operator:latest" in 326.358312ms
Normal Pulled 3m42s kubelet, 10.148.145.115 Successfully pulled image "gcr.io/flink-operator/flink-operator:latest" in 241.20523ms
Normal Created 3m9s (x4 over 4m6s) kubelet, 10.148.145.115 Created container flink-operator
Normal Started 3m9s (x4 over 4m5s) kubelet, 10.148.145.115 Started container flink-operator
Normal Pulling 3m9s (x4 over 4m6s) kubelet, 10.148.145.115 Pulling image "gcr.io/flink-operator/flink-operator:latest"
Normal Pulled 3m9s kubelet, 10.148.145.115 Successfully pulled image "gcr.io/flink-operator/flink-operator:latest" in 230.947675ms
Warning BackOff 2m40s (x6 over 3m53s) kubelet, 10.148.145.115 Back-off restarting failed container
@sumchak1 see, your describe output:
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Try change limits&requests: https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/blob/master/helm-chart/flink-operator/templates/flink-operator.yaml#L346-L351