cilium-etcd-operator icon indicating copy to clipboard operation
cilium-etcd-operator copied to clipboard

etcd-operator fails to start

Open githubcdr opened this issue 6 years ago • 11 comments

Hi,

I ran into an issue where the etcd-operator fails to bring up the etcd cluster, this happened after a crash of all my Kubernetes nodes.

The etcd-operator tries to bootstrap from scratch and keeps doing so, never reaching

time="2019-02-05T20:26:07Z" level=info msg="Deploying etcd-operator deployment..."
time="2019-02-05T20:26:07Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:08Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:09Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:10Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:11Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:12Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:13Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:14Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:15Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:16Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:17Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:18Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:19Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:20Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:21Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:22Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:23Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:24Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:25Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:26Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:27Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:28Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:29Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:30Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:31Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:32Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:33Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:34Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:35Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:26:35Z" level=info msg="Done! Re-creating etcd-operator deployment..."
time="2019-02-05T20:26:35Z" level=info msg="Done!"
time="2019-02-05T20:26:35Z" level=info msg="Deploying Cilium etcd cluster CR..."
time="2019-02-05T20:26:35Z" level=info msg=Done
time="2019-02-05T20:26:35Z" level=info msg="Sleeping for 5m0s to allow cluster to come up..."
time="2019-02-05T20:31:35Z" level=info msg="Starting to monitor cluster health..."
time="2019-02-05T20:31:37Z" level=info msg="Deploying etcd-operator CRD..."
time="2019-02-05T20:31:37Z" level=info msg="Done!"
time="2019-02-05T20:31:37Z" level=info msg="Deploying etcd-operator deployment..."
time="2019-02-05T20:31:37Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:31:39Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:31:40Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:31:41Z" level=info msg="Waiting for previous etcd-operator deployment to be removed..."
time="2019-02-05T20:31:41Z" level=info msg="Done! Re-creating etcd-operator deployment..."
time="2019-02-05T20:31:41Z" level=info msg="Done!"
time="2019-02-05T20:31:41Z" level=info msg="No running etcd pod found. Bootstrapping from scratch..."
time="2019-02-05T20:31:41Z" level=info msg="Deploying Cilium etcd cluster CR..."
time="2019-02-05T20:31:41Z" level=info msg=Done
time="2019-02-05T20:31:41Z" level=info msg="Sleeping for 5m0s to allow cluster to come up..."

I ran cleanup.sh and did a re-deploy, but this issue stays the same.

Is this a bug or am I missing something here?

githubcdr avatar Feb 05 '19 20:02 githubcdr

hello @githubcdr the etcd pods require a CNI plugin to be available, in this case Cilium. You can Cilium and etcd-operator at the same time.

let me know if this is the main problem.

aanm avatar Feb 06 '19 16:02 aanm

Hi, I know and this does not seem to be the issue here. I had a working setup after a crash the etcd-operator failed to bootstrap, as showed in the logs.

For now I use a single instance etcd, but would like to know how to solve this issue. I will retry when 1.4 goes live.

githubcdr avatar Feb 07 '19 20:02 githubcdr

@githubcdr Ok, please do try it when v1.4 is out. You can already try v1.4.0-rc8 if you want

aanm avatar Feb 08 '19 15:02 aanm

I'm having similar issue (looks like #5 in the beginning), the etcd nodes are stuck in init.

After a while clilium nodes fail because they can't connect to the etcd and etcd-operator cannot be created because there is no CNI...

Basically I see a couple of circular dependencies with the self hosted etcd:

  • kube-dns/coredns require CNI to start and etcd nodes require dns resolution to init
  • etcd nodes require CNI and cilium nodes require etcd
  • etcd nodes require etcd-operator to work but it requires CNI
  • cilium operator requires cilium and cilium requires cilium-operator ...

I have two workers on EKS, this is the blocked state:

NAMESPACE     NAME                                    READY   STATUS              RESTARTS   AGE
kube-system   aws-node-5lzh9                          1/1     Running             0          4h15m
kube-system   aws-node-lc8ln                          1/1     Running             0          4h15m
kube-system   cilium-etcd-operator-6b9bb595b8-rxlcv   1/1     Running             0          13m
kube-system   cilium-hv2dg                            0/1     Running             3          12m
kube-system   cilium-lv69c                            0/1     Running             3          13m
kube-system   cilium-operator-859bf557d8-bn29c        0/1     ContainerCreating   0          11m
kube-system   coredns-79c8c8bc-bhv9l                  0/1     ContainerCreating   0          13m
kube-system   coredns-79c8c8bc-fw82h                  0/1     ContainerCreating   0          13m
kube-system   etcd-operator-78bcbf4574-lsg5c          0/1     ContainerCreating   0          2m45s
kube-system   kube-proxy-f2276                        1/1     Running             0          4h15m
kube-system   kube-proxy-zckvb                        1/1     Running             0          4h15m
$ kubectl -n kube-system exec -ti cilium-hv2dg -- cilium status --verbose
KVStore:                Failure   Err: Not able to connect to any etcd endpoints
ContainerRuntime:       Ok        docker daemon: OK
Kubernetes:             Ok        1.12+ (v1.12.6-eks-d69f1b) [linux/amd64]
Kubernetes APIs:        ["CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "core/v1::Endpoint", "core/v1::Namespace", "core/v1::Node", "core/v1::Pods", "core/v1::Service", "networking.k8s.io/v1::NetworkPolicy"]
Cilium:                 Failure   Kvstore service is not ready
NodeMonitor:            Disabled
Cilium health daemon:   Ok   
IPv4 address pool:      6/65535 allocated from 10.65.0.0/16
Allocated addresses:
  10.65.0.1 (router)
  10.65.136.165 (kube-system/coredns-79c8c8bc-bhv9l)
  10.65.166.84 (loopback)
  10.65.199.138 (health)
  10.65.241.229 (kube-system/cilium-operator-859bf557d8-bn29c)
  10.65.67.13 (kube-system/etcd-operator-78bcbf4574-lsg5c)
Controller Status:   14/14 healthy
  Name                                Last success   Last error   Count   Message
  cilium-health-ep                    6s ago         never        0       no error   
  dns-garbage-collector-job           39s ago        never        0       no error   
  kvstore-etcd-session-renew          never          never        0       no error   
  lxcmap-bpf-host-sync                5s ago         never        0       no error   
  metricsmap-bpf-prom-sync            5s ago         never        0       no error   
  resolve-identity-0                  37s ago        never        0       no error   
  resolve-identity-1644               never          never        0       no error   
  sync-IPv4-identity-mapping (0)      never          never        0       no error   
  sync-lb-maps-with-k8s-services      37s ago        never        0       no error   
  sync-policymap-813                  36s ago        never        0       no error   
  sync-to-k8s-ciliumendpoint (137)    6s ago         never        0       no error   
  sync-to-k8s-ciliumendpoint (1644)   6s ago         never        0       no error   
  sync-to-k8s-ciliumendpoint (2631)   6s ago         never        0       no error   
  template-dir-watcher                never          never        0       no error   
Proxy Status:   OK, ip 10.65.0.1, port-range 10000-20000
command terminated with exit code 1

cilium-operator fails:

  Warning  FailedCreatePodSandBox  3m43s               kubelet, ip-172-16-14-65.eu-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "5a7de195dabe3177e475ddbe4ab2680275648c86fa7cf0a3acdc44ba01701006" network for pod "cilium-operator-859bf557d8-bn29c": NetworkPlugin cni failed to set up pod "cilium-operator-859bf557d8-bn29c_kube-system" network: unexpected end of JSON input
  Warning  FailedCreatePodSandBox  104s                kubelet, ip-172-16-14-65.eu-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "3739736e77f63da24962ff869bdefa06200c052a63caeebcbbbdbaf2e85f94c6" network for pod "cilium-operator-859bf557d8-bn29c": NetworkPlugin cni failed to set up pod "cilium-operator-859bf557d8-bn29c_kube-system" network: unexpected end of JSON input, failed to clean up sandbox container "3739736e77f63da24962ff869bdefa06200c052a63caeebcbbbdbaf2e85f94c6" network for pod "cilium-operator-859bf557d8-bn29c": NetworkPlugin cni failed to teardown pod "cilium-operator-859bf557d8-bn29c_kube-system" network: unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get http://%!F(MISSING)var%!F(MISSING)run%!F(MISSING)cilium%!F(MISSING)cilium.sock/v1/config: dial unix /var/run/cilium/cilium.sock: connect: connection refused]
  Normal   SandboxChanged          103s (x8 over 12m)  kubelet, ip-172-16-14-65.eu-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.

etcd-operator fails:

  Warning  FailedCreatePodSandBox  55s                kubelet, ip-172-16-14-65.eu-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "1af198dcce2527c8ba6942f3fe775cca94a409772fda556b6a787c66169221b5" network for pod "etcd-operator-78bcbf4574-v44sd": NetworkPlugin cni failed to set up pod "etcd-operator-78bcbf4574-v44sd_kube-system" network: unexpected end of JSON input
  Warning  FailedCreatePodSandBox  48s                kubelet, ip-172-16-14-65.eu-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "e0a79cbfc4bffef4389ef9e76977161a9bb9686773bf7694797f21272b76c1a0" network for pod "etcd-operator-78bcbf4574-v44sd": NetworkPlugin cni failed to set up pod "etcd-operator-78bcbf4574-v44sd_kube-system" network: unexpected end of JSON input, failed to clean up sandbox container "e0a79cbfc4bffef4389ef9e76977161a9bb9686773bf7694797f21272b76c1a0" network for pod "etcd-operator-78bcbf4574-v44sd": NetworkPlugin cni failed to teardown pod "etcd-operator-78bcbf4574-v44sd_kube-system" network: failed to find plugin "cilium-cni" in path [/opt/cni/bin]]
  Normal   SandboxChanged          47s (x2 over 55s)  kubelet, ip-172-16-14-65.eu-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.

core-dns fails:

  Warning  FailedCreatePodSandBox  5m2s                kubelet, ip-172-16-14-65.eu-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "36042c2da39106c5d95e65c7313cc138117c270d229f6b4f405ed433e41c7751" network for pod "coredns-79c8c8bc-bhv9l": NetworkPlugin cni failed to set up pod "coredns-79c8c8bc-bhv9l_kube-system" network: unexpected end of JSON input, failed to clean up sandbox container "36042c2da39106c5d95e65c7313cc138117c270d229f6b4f405ed433e41c7751" network for pod "coredns-79c8c8bc-bhv9l": NetworkPlugin cni failed to teardown pod "coredns-79c8c8bc-bhv9l_kube-system" network: unable to connect to Cilium daemon: failed to create cilium agent client after 30.000000 seconds timeout: Get http://%!F(MISSING)var%!F(MISSING)run%!F(MISSING)cilium%!F(MISSING)cilium.sock/v1/config: dial unix /var/run/cilium/cilium.sock: connect: connection refused]
  Normal   SandboxChanged          14s (x13 over 18m)  kubelet, ip-172-16-14-65.eu-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.
  Warning  FailedCreatePodSandBox  14s (x3 over 3m9s)  kubelet, ip-172-16-14-65.eu-west-2.compute.internal  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "3fde887abf14d6b79670d4090e4949c6038f3ca1f5af839b1da918a13721459e" network for pod "coredns-79c8c8bc-bhv9l": NetworkPlugin cni failed to set up pod "coredns-79c8c8bc-bhv9l_kube-system" network: unexpected end of JSON input

and in the beginning etcd fails:

  Warning  FailedCreatePodSandBox  17s   kubelet, ip-172-16-3-107.eu-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7842241897cd658624e5bda52a625e4f55a044d49fb8594cce69ca40614e9242" network for pod "cilium-etcd-95tvvp4nsr": NetworkPlugin cni failed to set up pod "cilium-etcd-95tvvp4nsr_kube-system" network: unexpected end of JSON input
  Normal   SandboxChanged          17s   kubelet, ip-172-16-3-107.eu-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.

No idea what went wrong, tried with 1.4 and 1.5(rc) versions.

pawelprazak avatar Apr 24 '19 11:04 pawelprazak

I've tried to run coredns with hostNetwork: true to break one of the cycles, but with limited success.

I wish Cilium was using Kubernetes CRDs like Calico, it would make things vastly less complex, @tgraf any chance of that happening?

pawelprazak avatar Apr 24 '19 11:04 pawelprazak

@pawelprazak Yes, it is happening: https://github.com/cilium/cilium/pull/7573

It is targeted for 1.6.

tgraf avatar Apr 25 '19 08:04 tgraf

I've reproduced the problem:

  • everything works on a new EKS cluster
  • once you delete all worker nodes and recreate them, it breka to the state described above

I've tried all things that I could think of, but I couldn't fins a way to make cilium stable again on EKS once I got to this state.

pawelprazak avatar Apr 26 '19 07:04 pawelprazak

I've tried to cleanup everything:

	kubectl -n kube-system delete crd etcdclusters.etcd.database.coreos.com
	kubectl -n kube-system delete crd ciliumendpoints.cilium.io
	kubectl -n kube-system delete crd ciliumnetworkpolicies.cilium.io
	kubectl -n kube-system delete deployment etcd-operator
	kubectl -n kube-system delete secrets cilium-etcd-client-tls
	kubectl -n kube-system delete secrets cilium-etcd-peer-tls
	kubectl -n kube-system delete secrets cilium-etcd-server-tls
	kubectl -n kube-system delete secrets cilium-etcd-secrets
	kubectl delete -f cilium.yaml

but it didn't help.

Looks like the endpoints are not being created:

# pods
NAMESPACE     NAME                                    READY   STATUS              RESTARTS   AGE
kube-system   aws-node-q548z                          1/1     Running             0          19h
kube-system   aws-node-vbmc8                          1/1     Running             0          17h
kube-system   cilium-5sgcx                            1/1     Running             3          11m
kube-system   cilium-etcd-operator-76784dc54b-kdc54   1/1     Running             0          11m
kube-system   cilium-operator-f7b8f7674-5sj7r         1/1     Running             3          11m
kube-system   cilium-qbl44                            1/1     Running             3          11m
kube-system   coredns-5c9bdb6577-5q49f                0/1     ContainerCreating   0          9m51s
kube-system   coredns-5c9bdb6577-gccsx                0/1     ContainerCreating   0          9m51s
kube-system   etcd-operator-86bff97c4f-jdhvb          0/1     ContainerCreating   0          71s
kube-system   kube-proxy-rhh9b                        1/1     Running             0          19h
kube-system   kube-proxy-zzk8b                        1/1     Running             0          17h
# services
NAME       TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE   SELECTOR
kube-dns   ClusterIP   10.100.0.10   <none>        53/UDP,53/TCP,9153/TCP   44m   k8s-app=kube-dns
# endpoints
NAME                      ENDPOINTS   AGE
etcd-operator             <none>      41m
kube-controller-manager   <none>      24h
kube-dns                  <none>      44m
kube-scheduler            <none>      24h

pawelprazak avatar Apr 26 '19 08:04 pawelprazak

@pawelprazak what's the output of kubectl get pods --all-namespaces -o wide? Thanks

aanm avatar Apr 28 '19 20:04 aanm

at first:

aws-node-8m9j6                          1/1     Running             0          6h3m    172.16.14.128   ip-172-16-14-128....   <none>
cilium-etcd-operator-76784dc54b-pz2h9   1/1     Running             0          84s     172.16.14.128   ip-172-16-14-128....   <none>
cilium-etcd-pptb59fz5b                  0/1     Running             0          70s     172.16.14.170   ip-172-16-14-128....   <none>
cilium-etcd-pxnsjrr5vj                  0/1     Init:0/1            0          53s     <none>          ip-172-16-14-128....   <none>
cilium-operator-f7b8f7674-pppch         1/1     Running             0          85s     172.16.14.14    ip-172-16-14-128....   <none>
cilium-t68dr                            1/1     Running             0          83s     172.16.14.128   ip-172-16-14-128....   <none>
cluster-autoscaler-7999ccbdbf-b9w9f     1/1     Running             0          50m     172.16.14.237   ip-172-16-14-128....   <none>
cluster-autoscaler-7999ccbdbf-dshfj     1/1     Running             0          50m     172.16.14.193   ip-172-16-14-128....   <none>
coredns-85f75755c7-ssb4v                0/1     ContainerCreating   0          47s     <none>          ip-172-16-14-128....   <none>
coredns-85f75755c7-x9b2b                0/1     ContainerCreating   0          47s     <none>          ip-172-16-14-128....   <none>
etcd-operator-86bff97c4f-kx77h          1/1     Running             0          80s     172.16.14.211   ip-172-16-14-128....   <none>
kube-proxy-gw6df                        1/1     Running             0          6h3m    172.16.14.128   ip-172-16-14-128....   <none>
kube-state-metrics-b764bb5c7-9ks8l      2/2     Running             0          6h3m    172.16.14.31    ip-172-16-14-128....   <none>
kube2iam-5cmg5                          1/1     Running             0          75s     172.16.14.128   ip-172-16-14-128....   <none>
metrics-server-84b6d4774f-cgxhs         1/1     Running             0          6h16m   172.16.14.128   ip-172-16-14-128....   <none>

and after a while:

NAME                                    READY   STATUS              RESTARTS   AGE     IP              NODE                                          NOMINATED NODE
aws-node-8m9j6                          1/1     Running             0          6h8m    172.16.14.128   ip-172-16-14-128....   <none>
cilium-etcd-operator-76784dc54b-pz2h9   1/1     Running             0          5m48s   172.16.14.128   ip-172-16-14-128....   <none>
cilium-operator-f7b8f7674-pppch         1/1     Running             0          5m49s   172.16.14.14    ip-172-16-14-128....   <none>
cilium-t68dr                            1/1     Running             0          5m47s   172.16.14.128   ip-172-16-14-128....   <none>
cluster-autoscaler-7999ccbdbf-b9w9f     1/1     Running             0          55m     172.16.14.237   ip-172-16-14-128....   <none>
cluster-autoscaler-7999ccbdbf-dshfj     1/1     Running             0          55m     172.16.14.193   ip-172-16-14-128....   <none>
coredns-85f75755c7-ssb4v                0/1     ContainerCreating   0          5m11s   <none>          ip-172-16-14-128....   <none>
coredns-85f75755c7-x9b2b                0/1     ContainerCreating   0          5m11s   <none>          ip-172-16-14-128....   <none>
etcd-operator-86bff97c4f-vllb7          0/1     ContainerCreating   0          40s     <none>          ip-172-16-14-128....   <none>
kube-proxy-gw6df                        1/1     Running             0          6h8m    172.16.14.128   ip-172-16-14-128....   <none>
kube-state-metrics-b764bb5c7-9ks8l      2/2     Running             0          6h7m    172.16.14.31    ip-172-16-14-128....   <none>
kube2iam-5cmg5                          1/1     Running             0          5m39s   172.16.14.128   ip-172-16-14-128....   <none>
metrics-server-84b6d4774f-cgxhs         1/1     Running             0          6h20m   172.16.14.128   ip-172-16-14-128....   <none>

pawelprazak avatar Apr 29 '19 14:04 pawelprazak

Idly while I ABSOLUTELY DO NOT RECOMMEND THIS FOR A PRODUCTION CLUSTER I was able to repair this by shoving the cilium state in the etcd cluster that kubeadm provides for kubernetes.

(It's bad for production as your giving your API server access to its own data storage layer; anyone who can read secrets in the namespace in which cilium is deployed would get direct access to etcd. Privesc ahoy!)

If you're in a bad way in production and you need to repair it fast one strategy might be to deploy an etcd node somewhere safe (kube-master, separate cluster) and patch the cilium config map to point to that.

andrewhowdencom avatar May 28 '19 20:05 andrewhowdencom