cilium icon indicating copy to clipboard operation
cilium copied to clipboard

Unable to create endpoint: Cilium API client timeout exceeded

Open FranAguiar opened this issue 1 year ago • 25 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

What happened?

Since a few weeks ago, some pods in my GKE cluster have been getting stuck in the PodCreating state. When I run a kubectl pod describe, I get this error:

 Warning  FailedCreatePodSandBox  4m45s (x138 over 4h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6eec890c6d2dbfefc3fa6ab1bf8db4f81ccb0c2f53ad757fb8f573a2bf9eca68": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

And in the logs of the cilium-agent container I found this:

{"containerID":"","datapathPolicyRevision":0,"desiredPolicyRevision":0,"endpointID":2588,"error":"timeout while waiting for initial endpoint generation to complete: context canceled","ipv4":"","ipv6":"","k8sPodName":"/","level":"warning","msg":"Creation of endpoint failed","subsys":"daemon"}
{"containerID":"","datapathPolicyRevision":0,"desiredPolicyRevision":0,"endpointID":795,"error":"unable to resolve identity: failed to assign a global identity for lables: k8s:app.kubernetes.io/component=prometheus,k8s:app.kubernetes.io/instance=monitoring-kube-prometheus-prometheus,k8s:app.kubernetes.io/managed-by=prometheus-operator,k8s:app.kubernetes.io/name=prometheus,k8s:app.kubernetes.io/version=2.48.1,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=monitoring,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=monitoring-kube-prometheus-prometheus,k8s:io.kubernetes.pod.namespace=monitoring,k8s:operator.prometheus.io/name=monitoring-kube-prometheus-prometheus,k8s:operator.prometheus.io/shard=0,k8s:prometheus=monitoring-kube-prometheus-prometheus","identityLabels":{"app.kubernetes.io/component":{"key":"app.kubernetes.io/component","value":"prometheus","source":"k8s"},"app.kubernetes.io/instance":{"key":"app.kubernetes.io/instance","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"app.kubernetes.io/managed-by":{"key":"app.kubernetes.io/managed-by","value":"prometheus-operator","source":"k8s"},"app.kubernetes.io/name":{"key":"app.kubernetes.io/name","value":"prometheus","source":"k8s"},"app.kubernetes.io/version":{"key":"app.kubernetes.io/version","value":"2.48.1","source":"k8s"},"io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name":{"key":"io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name","value":"monitoring","source":"k8s"},"io.cilium.k8s.policy.cluster":{"key":"io.cilium.k8s.policy.cluster","value":"default","source":"k8s"},"io.cilium.k8s.policy.serviceaccount":{"key":"io.cilium.k8s.policy.serviceaccount","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"io.kubernetes.pod.namespace":{"key":"io.kubernetes.pod.namespace","value":"monitoring","source":"k8s"},"operator.prometheus.io/name":{"key":"operator.prometheus.io/name","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"operator.prometheus.io/shard":{"key":"operator.prometheus.io/shard","value":"0","source":"k8s"},"prometheus":{"key":"prometheus","value":"monitoring-kube-prometheus-prometheus","source":"k8s"}},"ipv4":"","ipv6":"","k8sPodName":"/","level":"warning","msg":"Error changing endpoint identity","subsys":"endpoint"}

If I manually delete the pod it start without any issue.

Cilium Version

Client: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64 Daemon: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64

KVStore: Ok Disabled Kubernetes: Ok 1.29 (v1.29.4-gke.1043000) [linux/amd64] Kubernetes APIs: ["cilium/v2::CiliumLocalRedirectPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumEndpointSlice", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"] KubeProxyReplacement: Strict [eth0 10.3.2.219 (Direct Routing)] Host firewall: Disabled CNI Chaining: generic-veth CNI Config file: CNI configuration file management disabled Cilium: Ok 1.13.12 (v1.13.12-38d04fa903) NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory IPAM: IPv4: 0/62 allocated from 10.3.48.192/26, IPv6 BIG TCP: Disabled BandwidthManager: EDT with BPF [CUBIC] [eth0] Host Routing: Legacy Masquerading: Disabled Controller Status: 77/77 healthy Proxy Status: OK, ip 169.254.4.6, 0 redirects active on ports 10000-20000 Global Identity Range: min 256, max 65535 Hubble: Ok Current/Max Flows: 63/63 (100.00%), Flows/s: 27.13 Metrics: Ok Encryption: Disabled Cluster health: Probe disabled

Kernel Version

6.1.75+ #1 SMP PREEMPT_DYNAMIC Sat Mar 30 14:38:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Client Version: v1.28.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.4-gke.1043000

Anything else?

The cluster is in GKE with Dataplane V2 enabled. I don't have control over the cilium agents, it's managed by google. This is only happening on my cluster in the rapid channel, so I'm not sure if it's a bug or some incompatibility between the cilium version and the services affected.

I can't replicate the error and have no clue about the root cause. Any help will be appreciated, thanks in advance!!

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

FranAguiar avatar May 07 '24 14:05 FranAguiar

Hey, I literally got the same issue, I also opened a support case on GCP, see if it helps, I'll update here if I have some news!

Sh4d1 avatar May 10 '24 12:05 Sh4d1

Oh, that's great. Please keep me posted. I also noticed that this only happens to stuff deployed with Helm (maybe it's just a coincidence, but who knows). Is that the case for you too?

FranAguiar avatar May 10 '24 12:05 FranAguiar

Will do! Nope it's deployed with pulumi for me (through native k8s objects). Do you have a lot of pod changes ? (Like deploys, HPA that are often scaling, ...)

Sh4d1 avatar May 10 '24 12:05 Sh4d1

No, not in particular. Especially for RabbitMQ, the service that is suffering the most from this issue.

FranAguiar avatar May 10 '24 12:05 FranAguiar

We encountered a similar issue after upgrading our GKE cluster from 1.29.1-gke.1589000 to 1.29.4-gke.1447000. In our case, downgrading the cluster helped, and we were able to fix the issue. @Sh4d1, any luck with GCP support?

r0bj avatar May 10 '24 16:05 r0bj

Interesting! I'm on 1.29.3. And no luck with the support yet (I've linked them the issue as well).

Sh4d1 avatar May 10 '24 16:05 Sh4d1

in my case, this happens after node upgrade since we use rapid channel. but not always, sometimes pod recreated successfully.

also, for now this happens only to statefulsets.

harispraba avatar May 13 '24 08:05 harispraba

@harispraba Now that you mention it, the services that are experiencing this error in my cluster are also statefulsets. Rabbitmq, thanos-storage, and prometheus.

@Sh4d1 Hi, any news from GCP support? Is the ticket public or is it private only with you?

FranAguiar avatar May 13 '24 09:05 FranAguiar

Hum I think it's only happened to statefull sets on my end as well!

@FranAguiar it's private, and no luck yet (they asked for the whole logs from cilium but I don't have it anymore, so waiting for next occurrence to catch it)

Sh4d1 avatar May 13 '24 09:05 Sh4d1

Assigning to @christarazi, who recently worked on endpoint regeneration and statefulset updates.

squeed avatar May 15 '24 11:05 squeed

recently faced this as well. In my case, it turns out that Cilium fails to create an endpoint due to the number of character which should be no more than 63. Looking at the Cilium agent logs on the node; we see

[CiliumIdentity.cilium.io](http://ciliumidentity.cilium.io/) \"47595\" is invalid: metadata.labels: Invalid value: \"xxxxxxx": must be no more than 63 characters","key":{"LabelArray":

renaming the chart name to a fewer characters fixed it.

voltagebots avatar May 15 '24 18:05 voltagebots

Can you try v1.15.5? It contains https://github.com/cilium/cilium/pull/31605 which might resolve this problem.

christarazi avatar May 15 '24 19:05 christarazi

Hello, I just update my GKE cluster to latest version

Server Version: v1.30.0-gke.1457000

And it comes with new cilium version

Client: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64
Daemon: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64

I hope that solve the issue

FranAguiar avatar May 16 '24 09:05 FranAguiar

I have the issue described in OP on 1.15.5 on my homelab baremetal Talos cluster, but it seems to only happen on one node. All pods on that node fails to schedule and the only useful logs are the exact ones reported in OP.

Killing the Cilium pod on that node fixes it for a while until it happens again, and sometimes it happens right from the start of that Cilium pod's lifetime and that pod needds to be killed too. cilium-dbg status --verbose shows that this node's endpoints are unreachable, and the other 2 nodes are fine.

Versions: Talos: 1.6.4 Kubernetes: 1.29.2 Cilium: 1.15.5

Cilium Helm values (these 2 files get merged by Flux Helm controller, and the hr.yaml will override the config/biohazard/helm-values.yaml if there are conflicting values): https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/hr.yaml https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/config/biohazard/helm-values.yaml

JJGadgets avatar May 16 '24 15:05 JJGadgets

@JJGadgets Are the workloads statefulsets? If so, please provide the Cilium logs when that occurs.

christarazi avatar May 16 '24 17:05 christarazi

@christarazi Nope, everything from deployments, daemonsets, jobs, to KubeVirt VMs (which is a custom controller AFAIK).

JJGadgets avatar May 16 '24 17:05 JJGadgets

@JJGadgets Ok, that sounds like a separate issue from this thread. It seems that the initial report is for statefulsets. I would encourage you to file a new issue with a sysdump of when the issue occurred.

christarazi avatar May 16 '24 17:05 christarazi

@christarazi will create the separate issue when I encounter the issue again, for now the node and its Cilium pod is happy.

JJGadgets avatar May 16 '24 17:05 JJGadgets

Hello, I just update my GKE cluster to latest version

Server Version: v1.30.0-gke.1457000

And it comes with new cilium version

Client: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64
Daemon: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64

I hope that solve the issue

Just happen again, this is the log from the cilium container Explore-logs-2024-05-17 09_20_52.txt

@Sh4d1 Share this log with google support if you want

FranAguiar avatar May 17 '24 08:05 FranAguiar

Any version below 1.15.5 will not have the statefulset fix, so please try upgrading to that.

christarazi avatar May 17 '24 17:05 christarazi

In my case, not only were stateful sets affected, but deployments were as well.

r0bj avatar May 17 '24 19:05 r0bj

@r0bj That sounds like a separate issue as mentioned in https://github.com/cilium/cilium/issues/32399#issuecomment-2115851239

christarazi avatar May 17 '24 19:05 christarazi

Hey! I have a similiar problem with some of my pods in a AWS cluster. This is the error message:

kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f40f8a986e4c53a727336b51fb698fee152f06b3357da0079a5ed204ed7d22a0": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

Cilium verison is 1.16.0-dev. Anyone encountered the same issue with that verison as well?

liquidiert avatar May 26 '24 18:05 liquidiert

I'm running a k8s cluster on GKE v1.28.9-gke.1000000 and am seeing the same problem with pods that are part of a Deployment.

image

I am using vCluster to spin up virtual k8s cluster on top of my GKE cluster, but that may not be relevant.

The kernel version on the GKE nodes is:

Linux gke-shared-review-clu-primary-8f67e62-89d2caa4-6nuk 5.15.0-1054-gke #59-Ubuntu SMP Tue Mar 12 22:55:37 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The output of cilium-cni --version is:

Cilium CNI plugin 1.13.10 3461e7e708 2024-03-18T18:44:09+00:00 go version go1.21.8 linux/amd64
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0

The kublet logs don't show anything else than what is displayed in the pods events table:

E0610 14:29:39.556435    2889 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"istiod-598d555bc6-hhmbf-x-istio-system-x-vcluster-57d3f70b_export-loki-logs-otel-06-10-01-vcluster(b7693025-66fc-4d3b-976b-1ebf6d63599c)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"istiod-598d555bc6-hhmbf-x-istio-system-x-vcluster-57d3f70b_export-loki-logs-otel-06-10-01-vcluster(b7693025-66fc-4d3b-976b-1ebf6d63599c)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"c7d21e4e06b420dad5fc6c0f730e7cba1560a18036c385faf94f3e26361d925f\\\": plugin type=\\\"cilium-cni\\\" failed (add): unable to create endpoint: Cilium API client timeout exceeded\"" pod="export-loki-logs-otel-06-10-01-vcluster/istiod-598d555bc6-hhmbf-x-istio-system-x-vcluster-57d3f70b" podUID="b7693025-66fc-4d3b-976b-1ebf6d63599c"

The cilium-agent logs show a bunch of these errors: image

alextricity25 avatar Jun 10 '24 14:06 alextricity25

Hey! I have a similiar problem with some of my pods in a AWS cluster. This is the error message:

kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f40f8a986e4c53a727336b51fb698fee152f06b3357da0079a5ed204ed7d22a0": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

Cilium verison is 1.16.0-dev. Anyone encountered the same issue with that verison as well?

Yes on 1.15.6.

We had to add alert for pods being in init phase for longer than X and the only way to fix this is to manually delete pod to reschedule it on different node (cordon faulty node as well).

michalschott avatar Jun 28 '24 09:06 michalschott

Thank you for all the comments. As Chris pointed out, if you are running a Cilium version below 1.15.5 it will not have the statefulset fix, so please try upgrading to that.

As others mention that are running on or above v1.15.5 please provide reproducible steps so that we can track down the issue.

Thank you

aanm avatar Jul 12 '24 14:07 aanm

seeing this error on eks 1.30 in aws, cilium 1.15.5. Not sure how to reproduce yet, i have 8 clusters, it's happening on several of them, in one case with the ca-injector pod of the cert-manager service

Warning FailedCreatePodSandBox 63s (x14 over 23m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "fcbcd10d03a65468af8d17915321cd806aeabcb1645b45258e7d044d608b5555": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded

robpearce-flux avatar Sep 01 '24 23:09 robpearce-flux