Unable to create endpoint: Cilium API client timeout exceeded
Is there an existing issue for this?
- [X] I have searched the existing issues
What happened?
Since a few weeks ago, some pods in my GKE cluster have been getting stuck in the PodCreating state. When I run a kubectl pod describe, I get this error:
Warning FailedCreatePodSandBox 4m45s (x138 over 4h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "6eec890c6d2dbfefc3fa6ab1bf8db4f81ccb0c2f53ad757fb8f573a2bf9eca68": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded
And in the logs of the cilium-agent container I found this:
{"containerID":"","datapathPolicyRevision":0,"desiredPolicyRevision":0,"endpointID":2588,"error":"timeout while waiting for initial endpoint generation to complete: context canceled","ipv4":"","ipv6":"","k8sPodName":"/","level":"warning","msg":"Creation of endpoint failed","subsys":"daemon"}
{"containerID":"","datapathPolicyRevision":0,"desiredPolicyRevision":0,"endpointID":795,"error":"unable to resolve identity: failed to assign a global identity for lables: k8s:app.kubernetes.io/component=prometheus,k8s:app.kubernetes.io/instance=monitoring-kube-prometheus-prometheus,k8s:app.kubernetes.io/managed-by=prometheus-operator,k8s:app.kubernetes.io/name=prometheus,k8s:app.kubernetes.io/version=2.48.1,k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=monitoring,k8s:io.cilium.k8s.policy.cluster=default,k8s:io.cilium.k8s.policy.serviceaccount=monitoring-kube-prometheus-prometheus,k8s:io.kubernetes.pod.namespace=monitoring,k8s:operator.prometheus.io/name=monitoring-kube-prometheus-prometheus,k8s:operator.prometheus.io/shard=0,k8s:prometheus=monitoring-kube-prometheus-prometheus","identityLabels":{"app.kubernetes.io/component":{"key":"app.kubernetes.io/component","value":"prometheus","source":"k8s"},"app.kubernetes.io/instance":{"key":"app.kubernetes.io/instance","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"app.kubernetes.io/managed-by":{"key":"app.kubernetes.io/managed-by","value":"prometheus-operator","source":"k8s"},"app.kubernetes.io/name":{"key":"app.kubernetes.io/name","value":"prometheus","source":"k8s"},"app.kubernetes.io/version":{"key":"app.kubernetes.io/version","value":"2.48.1","source":"k8s"},"io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name":{"key":"io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name","value":"monitoring","source":"k8s"},"io.cilium.k8s.policy.cluster":{"key":"io.cilium.k8s.policy.cluster","value":"default","source":"k8s"},"io.cilium.k8s.policy.serviceaccount":{"key":"io.cilium.k8s.policy.serviceaccount","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"io.kubernetes.pod.namespace":{"key":"io.kubernetes.pod.namespace","value":"monitoring","source":"k8s"},"operator.prometheus.io/name":{"key":"operator.prometheus.io/name","value":"monitoring-kube-prometheus-prometheus","source":"k8s"},"operator.prometheus.io/shard":{"key":"operator.prometheus.io/shard","value":"0","source":"k8s"},"prometheus":{"key":"prometheus","value":"monitoring-kube-prometheus-prometheus","source":"k8s"}},"ipv4":"","ipv6":"","k8sPodName":"/","level":"warning","msg":"Error changing endpoint identity","subsys":"endpoint"}
If I manually delete the pod it start without any issue.
Cilium Version
Client: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64 Daemon: 1.13.12 38d04fa903 2024-04-05T00:06:43+00:00 go version go1.21.8 linux/amd64
KVStore: Ok Disabled Kubernetes: Ok 1.29 (v1.29.4-gke.1043000) [linux/amd64] Kubernetes APIs: ["cilium/v2::CiliumLocalRedirectPolicy", "cilium/v2::CiliumNode", "cilium/v2alpha1::CiliumEndpointSlice", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "discovery/v1::EndpointSlice", "networking.k8s.io/v1::NetworkPolicy"] KubeProxyReplacement: Strict [eth0 10.3.2.219 (Direct Routing)] Host firewall: Disabled CNI Chaining: generic-veth CNI Config file: CNI configuration file management disabled Cilium: Ok 1.13.12 (v1.13.12-38d04fa903) NodeMonitor: Listening for events on 8 CPUs with 64x4096 of shared memory IPAM: IPv4: 0/62 allocated from 10.3.48.192/26, IPv6 BIG TCP: Disabled BandwidthManager: EDT with BPF [CUBIC] [eth0] Host Routing: Legacy Masquerading: Disabled Controller Status: 77/77 healthy Proxy Status: OK, ip 169.254.4.6, 0 redirects active on ports 10000-20000 Global Identity Range: min 256, max 65535 Hubble: Ok Current/Max Flows: 63/63 (100.00%), Flows/s: 27.13 Metrics: Ok Encryption: Disabled Cluster health: Probe disabled
Kernel Version
6.1.75+ #1 SMP PREEMPT_DYNAMIC Sat Mar 30 14:38:17 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Client Version: v1.28.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.29.4-gke.1043000
Anything else?
The cluster is in GKE with Dataplane V2 enabled. I don't have control over the cilium agents, it's managed by google. This is only happening on my cluster in the rapid channel, so I'm not sure if it's a bug or some incompatibility between the cilium version and the services affected.
I can't replicate the error and have no clue about the root cause. Any help will be appreciated, thanks in advance!!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hey, I literally got the same issue, I also opened a support case on GCP, see if it helps, I'll update here if I have some news!
Oh, that's great. Please keep me posted. I also noticed that this only happens to stuff deployed with Helm (maybe it's just a coincidence, but who knows). Is that the case for you too?
Will do! Nope it's deployed with pulumi for me (through native k8s objects). Do you have a lot of pod changes ? (Like deploys, HPA that are often scaling, ...)
No, not in particular. Especially for RabbitMQ, the service that is suffering the most from this issue.
We encountered a similar issue after upgrading our GKE cluster from 1.29.1-gke.1589000 to 1.29.4-gke.1447000. In our case, downgrading the cluster helped, and we were able to fix the issue. @Sh4d1, any luck with GCP support?
Interesting! I'm on 1.29.3. And no luck with the support yet (I've linked them the issue as well).
in my case, this happens after node upgrade since we use rapid channel. but not always, sometimes pod recreated successfully.
also, for now this happens only to statefulsets.
@harispraba Now that you mention it, the services that are experiencing this error in my cluster are also statefulsets. Rabbitmq, thanos-storage, and prometheus.
@Sh4d1 Hi, any news from GCP support? Is the ticket public or is it private only with you?
Hum I think it's only happened to statefull sets on my end as well!
@FranAguiar it's private, and no luck yet (they asked for the whole logs from cilium but I don't have it anymore, so waiting for next occurrence to catch it)
Assigning to @christarazi, who recently worked on endpoint regeneration and statefulset updates.
recently faced this as well. In my case, it turns out that Cilium fails to create an endpoint due to the number of character which should be no more than 63. Looking at the Cilium agent logs on the node; we see
[CiliumIdentity.cilium.io](http://ciliumidentity.cilium.io/) \"47595\" is invalid: metadata.labels: Invalid value: \"xxxxxxx": must be no more than 63 characters","key":{"LabelArray":
renaming the chart name to a fewer characters fixed it.
Can you try v1.15.5? It contains https://github.com/cilium/cilium/pull/31605 which might resolve this problem.
Hello, I just update my GKE cluster to latest version
Server Version: v1.30.0-gke.1457000
And it comes with new cilium version
Client: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64
Daemon: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64
I hope that solve the issue
I have the issue described in OP on 1.15.5 on my homelab baremetal Talos cluster, but it seems to only happen on one node. All pods on that node fails to schedule and the only useful logs are the exact ones reported in OP.
Killing the Cilium pod on that node fixes it for a while until it happens again, and sometimes it happens right from the start of that Cilium pod's lifetime and that pod needds to be killed too. cilium-dbg status --verbose shows that this node's endpoints are unreachable, and the other 2 nodes are fine.
Versions: Talos: 1.6.4 Kubernetes: 1.29.2 Cilium: 1.15.5
Cilium Helm values (these 2 files get merged by Flux Helm controller, and the hr.yaml will override the config/biohazard/helm-values.yaml if there are conflicting values): https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/hr.yaml https://github.com/JJGadgets/Biohazard/blob/7004140fc1be893e1e35dac1d43148af749eb8da/kube/deploy/core/_networking/cilium/app/config/biohazard/helm-values.yaml
@JJGadgets Are the workloads statefulsets? If so, please provide the Cilium logs when that occurs.
@christarazi Nope, everything from deployments, daemonsets, jobs, to KubeVirt VMs (which is a custom controller AFAIK).
@JJGadgets Ok, that sounds like a separate issue from this thread. It seems that the initial report is for statefulsets. I would encourage you to file a new issue with a sysdump of when the issue occurred.
@christarazi will create the separate issue when I encounter the issue again, for now the node and its Cilium pod is happy.
Hello, I just update my GKE cluster to latest version
Server Version: v1.30.0-gke.1457000And it comes with new cilium version
Client: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64 Daemon: 1.14.7 47ecffbb57 2024-04-25T17:12:33-07:00 go version go1.21.8 linux/amd64I hope that solve the issue
Just happen again, this is the log from the cilium container Explore-logs-2024-05-17 09_20_52.txt
@Sh4d1 Share this log with google support if you want
Any version below 1.15.5 will not have the statefulset fix, so please try upgrading to that.
In my case, not only were stateful sets affected, but deployments were as well.
@r0bj That sounds like a separate issue as mentioned in https://github.com/cilium/cilium/issues/32399#issuecomment-2115851239
Hey! I have a similiar problem with some of my pods in a AWS cluster. This is the error message:
kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f40f8a986e4c53a727336b51fb698fee152f06b3357da0079a5ed204ed7d22a0": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded
Cilium verison is 1.16.0-dev. Anyone encountered the same issue with that verison as well?
I'm running a k8s cluster on GKE v1.28.9-gke.1000000 and am seeing the same problem with pods that are part of a Deployment.
I am using vCluster to spin up virtual k8s cluster on top of my GKE cluster, but that may not be relevant.
The kernel version on the GKE nodes is:
Linux gke-shared-review-clu-primary-8f67e62-89d2caa4-6nuk 5.15.0-1054-gke #59-Ubuntu SMP Tue Mar 12 22:55:37 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
The output of cilium-cni --version is:
Cilium CNI plugin 1.13.10 3461e7e708 2024-03-18T18:44:09+00:00 go version go1.21.8 linux/amd64
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0
The kublet logs don't show anything else than what is displayed in the pods events table:
E0610 14:29:39.556435 2889 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"istiod-598d555bc6-hhmbf-x-istio-system-x-vcluster-57d3f70b_export-loki-logs-otel-06-10-01-vcluster(b7693025-66fc-4d3b-976b-1ebf6d63599c)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"istiod-598d555bc6-hhmbf-x-istio-system-x-vcluster-57d3f70b_export-loki-logs-otel-06-10-01-vcluster(b7693025-66fc-4d3b-976b-1ebf6d63599c)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"c7d21e4e06b420dad5fc6c0f730e7cba1560a18036c385faf94f3e26361d925f\\\": plugin type=\\\"cilium-cni\\\" failed (add): unable to create endpoint: Cilium API client timeout exceeded\"" pod="export-loki-logs-otel-06-10-01-vcluster/istiod-598d555bc6-hhmbf-x-istio-system-x-vcluster-57d3f70b" podUID="b7693025-66fc-4d3b-976b-1ebf6d63599c"
The cilium-agent logs show a bunch of these errors:
Hey! I have a similiar problem with some of my pods in a AWS cluster. This is the error message:
kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "f40f8a986e4c53a727336b51fb698fee152f06b3357da0079a5ed204ed7d22a0": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceededCilium verison is
1.16.0-dev. Anyone encountered the same issue with that verison as well?
Yes on 1.15.6.
We had to add alert for pods being in init phase for longer than X and the only way to fix this is to manually delete pod to reschedule it on different node (cordon faulty node as well).
Thank you for all the comments. As Chris pointed out, if you are running a Cilium version below 1.15.5 it will not have the statefulset fix, so please try upgrading to that.
As others mention that are running on or above v1.15.5 please provide reproducible steps so that we can track down the issue.
Thank you
seeing this error on eks 1.30 in aws, cilium 1.15.5. Not sure how to reproduce yet, i have 8 clusters, it's happening on several of them, in one case with the ca-injector pod of the cert-manager service
Warning FailedCreatePodSandBox 63s (x14 over 23m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "fcbcd10d03a65468af8d17915321cd806aeabcb1645b45258e7d044d608b5555": plugin type="cilium-cni" failed (add): unable to create endpoint: Cilium API client timeout exceeded