cilium-etcd-operator
cilium-etcd-operator copied to clipboard
etcd-operator fails to start
Hi,
I ran into an issue where the etcd-operator fails to bring up the etcd cluster, this happened after a crash of all my Kubernetes nodes.
The etcd-operator tries to bootstrap from scratch and keeps doing so, never reaching
time="2019-02-05T20:26:07Z" level=info msg="Deploying etcd-operator deployment..."
time="2019-02-05T20:26:07Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:08Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:09Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:10Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:11Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:12Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:13Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:14Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:15Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:16Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:17Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:18Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:19Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:20Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:21Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:22Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:23Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:24Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:25Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:26Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:27Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:28Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:29Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:30Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:31Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:32Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:33Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:34Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:35Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:35Z" level=info msg="Done! Re-creating etcd-operator deployment..."
time="2019-02-05T20:26:35Z" level=info msg="Done!"
time="2019-02-05T20:26:35Z" level=info msg="Deploying Cilium etcd cluster CR..."
time="2019-02-05T20:26:35Z" level=info msg=Done
time="2019-02-05T20:26:35Z" level=info msg="Sleeping for 5m0s to allow cluster to come up..."
time="2019-02-05T20:31:35Z" level=info msg="Starting to monitor cluster health..."
time="2019-02-05T20:31:37Z" level=info msg="Deploying etcd-operator CRD..."
time="2019-02-05T20:31:37Z" level=info msg="Done!"
time="2019-02-05T20:31:37Z" level=info msg="Deploying etcd-operator deployment..."
time="2019-02-05T20:31:37Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:31:39Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:31:40Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:31:41Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:31:41Z" level=info msg="Done! Re-creating etcd-operator deployment..."
time="2019-02-05T20:31:41Z" level=info msg="Done!"
time="2019-02-05T20:31:41Z" level=info msg="No running etcd pod found. Bootstrapping from scratch..."
time="2019-02-05T20:31:41Z" level=info msg="Deploying Cilium etcd cluster CR..."
time="2019-02-05T20:31:41Z" level=info msg=Done
time="2019-02-05T20:31:41Z" level=info msg="Sleeping for 5m0s to allow cluster to come up..."
I ran cleanup.sh and did a re-deploy, but this issue stays the same.
Is this a bug or am I missing something here?
hello @githubcdr the etcd pods require a CNI plugin to be available, in this case Cilium. You can Cilium and etcd-operator at the same time.
let me know if this is the main problem.
Hi, I know and this does not seem to be the issue here. I had a working setup after a crash the etcd-operator failed to bootstrap, as showed in the logs.
For now I use a single instance etcd, but would like to know how to solve this issue. I will retry when 1.4 goes live.
@githubcdr Ok, please do try it when v1.4 is out. You can already try v1.4.0-rc8
if you want
I'm having similar issue (looks like #5 in the beginning), the etcd nodes are stuck in init
.
After a while clilium nodes fail because they can't connect to the etcd and etcd-operator cannot be created because there is no CNI...
Basically I see a couple of circular dependencies with the self hosted etcd:
- kube-dns/coredns require CNI to start and etcd nodes require dns resolution to init
- etcd nodes require CNI and cilium nodes require etcd
- etcd nodes require etcd-operator to work but it requires CNI
- cilium operator requires cilium and cilium requires cilium-operator ...
I have two workers on EKS, this is the blocked state:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-5lzh9 1/1 Running 0 4h15m
kube-system aws-node-lc8ln 1/1 Running 0 4h15m
kube-system cilium-etcd-operator-6b9bb595b8-rxlcv 1/1 Running 0 13m
kube-system cilium-hv2dg 0/1 Running 3 12m
kube-system cilium-lv69c 0/1 Running 3 13m
kube-system cilium-operator-859bf557d8-bn29c 0/1 ContainerCreating 0 11m
kube-system coredns-79c8c8bc-bhv9l 0/1 ContainerCreating 0 13m
kube-system coredns-79c8c8bc-fw82h 0/1 ContainerCreating 0 13m
kube-system etcd-operator-78bcbf4574-lsg5c 0/1 ContainerCreating 0 2m45s
kube-system kube-proxy-f2276 1/1 Running 0 4h15m
kube-system kube-proxy-zckvb 1/1 Running 0 4h15m
$ kubectl -n kube-system exec -ti cilium-hv2dg -- cilium status --verbose
KVStore: Failure Err: Not able to connect to any etcd endpoints
ContainerRuntime: Ok docker daemon: OK
Kubernetes: Ok 1.12+ (v1.12.6-eks-d69f1b) [linux/amd64]
Kubernetes APIs: ["CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
Cilium: Failure Kvstore service is not ready
NodeMonitor: Disabled
Cilium health daemon: Ok
IPv4 address pool: 6/65535 allocated from 10.65.0.0/16
Allocated addresses:
10.65.0.1 (router)
10.65.136.165 (kube-system/coredns-79c8c8bc-bhv9l)
10.65.166.84 (loopback)
10.65.199.138 (health)
10.65.241.229 (kube-system/cilium-operator-859bf557d8-bn29c)
10.65.67.13 (kube-system/etcd-operator-78bcbf4574-lsg5c)
Controller Status: 14/14 healthy
Name Last success Last error Count Message
cilium-health-ep 6s ago never 0 no error
dns-garbage-collector-job 39s ago never 0 no error
kvstore-etcd-session-renew never never 0 no error
lxcmap-bpf-host-sync 5s ago never 0 no error
metricsmap-bpf-prom-sync 5s ago never 0 no error
resolve-identity-0 37s ago never 0 no error
resolve-identity-1644 never never 0 no error
sync-IPv4-identity-mapping (0) never never 0 no error
sync-lb-maps-with-k8s-services 37s ago never 0 no error
sync-policymap-813 36s ago never 0 no error
sync-to-k8s-ciliumendpoint (137) 6s ago never 0 no error
sync-to-k8s-ciliumendpoint (1644) 6s ago never 0 no error
sync-to-k8s-ciliumendpoint (2631) 6s ago never 0 no error
template-dir-watcher never never 0 no error
Proxy Status: OK, ip 10.65.0.1, port-range 10000-20000
command terminated with exit code 1
cilium-operator fails:
Warning FailedCreatePodSandBox 3m43s kubelet, ip-172-16-14-65.eu-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5a7de195dabe3177e475ddbe4ab2680275648c86fa7cf0a3acdc44ba01701006" network for pod "cilium-operator-859bf557d8-bn29c": NetworkPlugin cni failed to set up pod "cilium-operator-859bf557d8-bn29c_kube-system" network: unexpected end of JSON input
Warning FailedCreatePodSandBox 104s kubelet, ip-172-16-14-65.eu-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "3739736e77f63da24962ff869bdefa06200c052a63caeebcbbbdbaf2e85f94c6" network for pod "cilium-operator-859bf557d8-bn29c": NetworkPlugin cni failed to set up pod "cilium-operator-859bf557d8-bn29c_kube-system" network: unexpected end of JSON input, failed to clean up sandbox container "3739736e77f63da24962ff869bdefa06200c052a63caeebcbbbdbaf2e85f94c6" network for pod "cilium-operator-859bf557d8-bn29c": NetworkPlugin cni failed to teardown pod "cilium-operator-859bf557d8-bn29c_kube-system" network: unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get http://%!F(MISSING)var%!F(MISSING)run%!F(MISSING)cilium%!F(MISSING)cilium.sock/v1/config: dial unix /var/run/cilium/cilium.sock: connect: connection refused]
Normal SandboxChanged 103s (x8 over 12m) kubelet, ip-172-16-14-65.eu-west-2.compute.internal Pod sandbox changed, it will be killed and re-created.
etcd-operator fails:
Warning FailedCreatePodSandBox 55s kubelet, ip-172-16-14-65.eu-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1af198dcce2527c8ba6942f3fe775cca94a409772fda556b6a787c66169221b5" network for pod "etcd-operator-78bcbf4574-v44sd": NetworkPlugin cni failed to set up pod "etcd-operator-78bcbf4574-v44sd_kube-system" network: unexpected end of JSON input
Warning FailedCreatePodSandBox 48s kubelet, ip-172-16-14-65.eu-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "e0a79cbfc4bffef4389ef9e76977161a9bb9686773bf7694797f21272b76c1a0" network for pod "etcd-operator-78bcbf4574-v44sd": NetworkPlugin cni failed to set up pod "etcd-operator-78bcbf4574-v44sd_kube-system" network: unexpected end of JSON input, failed to clean up sandbox container "e0a79cbfc4bffef4389ef9e76977161a9bb9686773bf7694797f21272b76c1a0" network for pod "etcd-operator-78bcbf4574-v44sd": NetworkPlugin cni failed to teardown pod "etcd-operator-78bcbf4574-v44sd_kube-system" network: failed to find plugin "cilium-cni" in path [/opt/cni/bin]]
Normal SandboxChanged 47s (x2 over 55s) kubelet, ip-172-16-14-65.eu-west-2.compute.internal Pod sandbox changed, it will be killed and re-created.
core-dns fails:
Warning FailedCreatePodSandBox 5m2s kubelet, ip-172-16-14-65.eu-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "36042c2da39106c5d95e65c7313cc138117c270d229f6b4f405ed433e41c7751" network for pod "coredns-79c8c8bc-bhv9l": NetworkPlugin cni failed to set up pod "coredns-79c8c8bc-bhv9l_kube-system" network: unexpected end of JSON input, failed to clean up sandbox container "36042c2da39106c5d95e65c7313cc138117c270d229f6b4f405ed433e41c7751" network for pod "coredns-79c8c8bc-bhv9l": NetworkPlugin cni failed to teardown pod "coredns-79c8c8bc-bhv9l_kube-system" network: unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get http://%!F(MISSING)var%!F(MISSING)run%!F(MISSING)cilium%!F(MISSING)cilium.sock/v1/config: dial unix /var/run/cilium/cilium.sock: connect: connection refused]
Normal SandboxChanged 14s (x13 over 18m) kubelet, ip-172-16-14-65.eu-west-2.compute.internal Pod sandbox changed, it will be killed and re-created.
Warning FailedCreatePodSandBox 14s (x3 over 3m9s) kubelet, ip-172-16-14-65.eu-west-2.compute.internal (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3fde887abf14d6b79670d4090e4949c6038f3ca1f5af839b1da918a13721459e" network for pod "coredns-79c8c8bc-bhv9l": NetworkPlugin cni failed to set up pod "coredns-79c8c8bc-bhv9l_kube-system" network: unexpected end of JSON input
and in the beginning etcd fails:
Warning FailedCreatePodSandBox 17s kubelet, ip-172-16-3-107.eu-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7842241897cd658624e5bda52a625e4f55a044d49fb8594cce69ca40614e9242" network for pod "cilium-etcd-95tvvp4nsr": NetworkPlugin cni failed to set up pod "cilium-etcd-95tvvp4nsr_kube-system" network: unexpected end of JSON input
Normal SandboxChanged 17s kubelet, ip-172-16-3-107.eu-west-2.compute.internal Pod sandbox changed, it will be killed and re-created.
No idea what went wrong, tried with 1.4 and 1.5(rc) versions.
I've tried to run coredns
with hostNetwork: true
to break one of the cycles, but with limited success.
I wish Cilium was using Kubernetes CRDs like Calico, it would make things vastly less complex, @tgraf any chance of that happening?
@pawelprazak Yes, it is happening: https://github.com/cilium/cilium/pull/7573
It is targeted for 1.6.
I've reproduced the problem:
- everything works on a new EKS cluster
- once you delete all worker nodes and recreate them, it breka to the state described above
I've tried all things that I could think of, but I couldn't fins a way to make cilium stable again on EKS once I got to this state.
I've tried to cleanup everything:
kubectl -n kube-system delete crd etcdclusters.etcd.database.coreos.com
kubectl -n kube-system delete crd ciliumendpoints.cilium.io
kubectl -n kube-system delete crd ciliumnetworkpolicies.cilium.io
kubectl -n kube-system delete deployment etcd-operator
kubectl -n kube-system delete secrets cilium-etcd-client-tls
kubectl -n kube-system delete secrets cilium-etcd-peer-tls
kubectl -n kube-system delete secrets cilium-etcd-server-tls
kubectl -n kube-system delete secrets cilium-etcd-secrets
kubectl delete -f cilium.yaml
but it didn't help.
Looks like the endpoints are not being created:
# pods
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-q548z 1/1 Running 0 19h
kube-system aws-node-vbmc8 1/1 Running 0 17h
kube-system cilium-5sgcx 1/1 Running 3 11m
kube-system cilium-etcd-operator-76784dc54b-kdc54 1/1 Running 0 11m
kube-system cilium-operator-f7b8f7674-5sj7r 1/1 Running 3 11m
kube-system cilium-qbl44 1/1 Running 3 11m
kube-system coredns-5c9bdb6577-5q49f 0/1 ContainerCreating 0 9m51s
kube-system coredns-5c9bdb6577-gccsx 0/1 ContainerCreating 0 9m51s
kube-system etcd-operator-86bff97c4f-jdhvb 0/1 ContainerCreating 0 71s
kube-system kube-proxy-rhh9b 1/1 Running 0 19h
kube-system kube-proxy-zzk8b 1/1 Running 0 17h
# services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kube-dns ClusterIP 10.100.0.10 <none> 53/UDP,53/TCP,9153/TCP 44m k8s-app=kube-dns
# endpoints
NAME ENDPOINTS AGE
etcd-operator <none> 41m
kube-controller-manager <none> 24h
kube-dns <none> 44m
kube-scheduler <none> 24h
@pawelprazak what's the output of kubectl get pods --all-namespaces -o wide
? Thanks
at first:
aws-node-8m9j6 1/1 Running 0 6h3m 172.16.14.128 ip-172-16-14-128.... <none>
cilium-etcd-operator-76784dc54b-pz2h9 1/1 Running 0 84s 172.16.14.128 ip-172-16-14-128.... <none>
cilium-etcd-pptb59fz5b 0/1 Running 0 70s 172.16.14.170 ip-172-16-14-128.... <none>
cilium-etcd-pxnsjrr5vj 0/1 Init:0/1 0 53s <none> ip-172-16-14-128.... <none>
cilium-operator-f7b8f7674-pppch 1/1 Running 0 85s 172.16.14.14 ip-172-16-14-128.... <none>
cilium-t68dr 1/1 Running 0 83s 172.16.14.128 ip-172-16-14-128.... <none>
cluster-autoscaler-7999ccbdbf-b9w9f 1/1 Running 0 50m 172.16.14.237 ip-172-16-14-128.... <none>
cluster-autoscaler-7999ccbdbf-dshfj 1/1 Running 0 50m 172.16.14.193 ip-172-16-14-128.... <none>
coredns-85f75755c7-ssb4v 0/1 ContainerCreating 0 47s <none> ip-172-16-14-128.... <none>
coredns-85f75755c7-x9b2b 0/1 ContainerCreating 0 47s <none> ip-172-16-14-128.... <none>
etcd-operator-86bff97c4f-kx77h 1/1 Running 0 80s 172.16.14.211 ip-172-16-14-128.... <none>
kube-proxy-gw6df 1/1 Running 0 6h3m 172.16.14.128 ip-172-16-14-128.... <none>
kube-state-metrics-b764bb5c7-9ks8l 2/2 Running 0 6h3m 172.16.14.31 ip-172-16-14-128.... <none>
kube2iam-5cmg5 1/1 Running 0 75s 172.16.14.128 ip-172-16-14-128.... <none>
metrics-server-84b6d4774f-cgxhs 1/1 Running 0 6h16m 172.16.14.128 ip-172-16-14-128.... <none>
and after a while:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
aws-node-8m9j6 1/1 Running 0 6h8m 172.16.14.128 ip-172-16-14-128.... <none>
cilium-etcd-operator-76784dc54b-pz2h9 1/1 Running 0 5m48s 172.16.14.128 ip-172-16-14-128.... <none>
cilium-operator-f7b8f7674-pppch 1/1 Running 0 5m49s 172.16.14.14 ip-172-16-14-128.... <none>
cilium-t68dr 1/1 Running 0 5m47s 172.16.14.128 ip-172-16-14-128.... <none>
cluster-autoscaler-7999ccbdbf-b9w9f 1/1 Running 0 55m 172.16.14.237 ip-172-16-14-128.... <none>
cluster-autoscaler-7999ccbdbf-dshfj 1/1 Running 0 55m 172.16.14.193 ip-172-16-14-128.... <none>
coredns-85f75755c7-ssb4v 0/1 ContainerCreating 0 5m11s <none> ip-172-16-14-128.... <none>
coredns-85f75755c7-x9b2b 0/1 ContainerCreating 0 5m11s <none> ip-172-16-14-128.... <none>
etcd-operator-86bff97c4f-vllb7 0/1 ContainerCreating 0 40s <none> ip-172-16-14-128.... <none>
kube-proxy-gw6df 1/1 Running 0 6h8m 172.16.14.128 ip-172-16-14-128.... <none>
kube-state-metrics-b764bb5c7-9ks8l 2/2 Running 0 6h7m 172.16.14.31 ip-172-16-14-128.... <none>
kube2iam-5cmg5 1/1 Running 0 5m39s 172.16.14.128 ip-172-16-14-128.... <none>
metrics-server-84b6d4774f-cgxhs 1/1 Running 0 6h20m 172.16.14.128 ip-172-16-14-128.... <none>
Idly while I ABSOLUTELY DO NOT RECOMMEND THIS FOR A PRODUCTION CLUSTER I was able to repair this by shoving the cilium state in the etcd cluster that kubeadm provides for kubernetes.
(It's bad for production as your giving your API server access to its own data storage layer; anyone who can read secrets in the namespace in which cilium is deployed would get direct access to etcd. Privesc ahoy!)
If you're in a bad way in production and you need to repair it fast one strategy might be to deploy an etcd node somewhere safe (kube-master, separate cluster) and patch the cilium config map to point to that.