flink-on-k8s-operator icon indicating copy to clipboard operation
flink-on-k8s-operator copied to clipboard

Getting context deadline exceeded error on eks cluster

Open vinaykw opened this issue 4 years ago • 2 comments

I am trying to deploy flink on aws eks cluster. The cluster does not have any specific master node as it is amazon managed cluster. I have successfully deployed the flink operator chart using helm. Next I tried deploying flink session cluster chart. For this I am getting below error:

helm install flink-session flink/flink-session-cluster/ -f version.yaml --debug
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /home/centos/.kube/config
WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /home/centos/.kube/config
install.go:172: [debug] Original chart version: ""
install.go:189: [debug] CHART PATH: /opt/eva/helm/charts/flink/flink-session-cluster

client.go:122: [debug] creating 5 resource(s)
Error: Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post https://flink-operator-webhook-service.default.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=30s: context deadline exceeded
helm.go:81: [debug] Internal error occurred: failed calling webhook "mflinkcluster.flinkoperator.k8s.io": Post https://flink-operator-webhook-service.default.svc:443/mutate-flinkoperator-k8s-io-v1beta1-flinkcluster?timeout=30s: context deadline exceeded

One observation I had is that there is no communication (ping not working) between the node (aws launch pad machine) that I am using to deploy session cluster chart and the webhook service "flink-operator-webhook-service.default.svc" .

I have successfully deployed flink operator and the flink cluster on the non aws k8s cluster. There I can see that such communication is present.

Can Someone help me in knowing what is the issue?

vinaykw avatar Feb 02 '21 07:02 vinaykw

Hello @vinaykw did you resolve this issue ? (more than one year later ^^)

lliknart avatar Jul 28 '22 12:07 lliknart

Hello,

I got similar issue with an other operator... May be it could help:

Take a look to security groups (with terraform-aws-eks, they will be named xxxx-cluster and xxxx-node). By default ports 443 and 10250 are allow; if your pod doesn't listen on one allowed port, it will be blocked (even if k8s api server call service url with port 443). To check used port, use command kubectl get endpoints -n <operator namespace>. You can read more about endpoint here.

Some operator like prometheus or cert-manager don't have this issue because their validation webhook is listening on port 10250.

emmanuelCarre avatar Aug 05 '22 08:08 emmanuelCarre