tectonic-installer icon indicating copy to clipboard operation
tectonic-installer copied to clipboard

Bare Metal Cluster Never Starts using 1.7.5-tectonic.1 Installer

Open mmellison opened this issue 7 years ago • 16 comments

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

  • Tectonic version (release or commit hash): 1.7.5-tectonic.1
  • Terraform version (terraform version): v0.10.4 (Bundled w/ Installer)
  • Platform (aws|azure|openstack|metal): Metal

What happened?

I am able to successfully run Terraform, however the cluster never fully comes up. It appears the apiserver is not starting correctly, possibly caused by issues with etcd. I am still learning how everything works, so not sure if the two are related. See log snippets at end of post.

Cluster starts correctly using the installers from the 1.7.3 releases, but even after 3 attempts with the 1.7.5-tectonic.1 installer, I am unable to get a cluster going.

What you expected to happen?

The apiserver (and thus the cluster) to become available.

How to reproduce it (as minimally and precisely as possible)?

  • Two Master Nodes
  • One Worker Node
  • Provisioned Etcd

Anything else we need to know?

See Logs Snippets Below -

Etcd

Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.382731 W | rafthttp: health check for peer 5d62e2d0c21c6423 could not connect: dial tcp: lookup 1.packet.kube.arroyo.io on [::1]:53: read udp [::1]:43632->[::1]:53: read: connection refused
Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.450686 I | raft: 8b84d3e5347e393e is starting a new election at term 1404
Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.450730 I | raft: 8b84d3e5347e393e became candidate at term 1405
Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.450754 I | raft: 8b84d3e5347e393e received MsgVoteResp from 8b84d3e5347e393e at term 1405
Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.450781 I | raft: 8b84d3e5347e393e [logterm: 78, index: 10] sent MsgVote request to 5d62e2d0c21c6423 at term 1405

Kubelet

Oct 13 18:03:34 0.packet.kube.arroyo.io kubelet-wrapper[847]: I1013 18:03:34.606653     847 kubelet_node_status.go:247] Setting node annotation to enable volume controller attach/detach
Oct 13 18:03:34 0.packet.kube.arroyo.io kubelet-wrapper[847]: I1013 18:03:34.609618     847 kubelet_node_status.go:82] Attempting to register node 0.packet.kube.arroyo.io
Oct 13 18:03:34 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:34.611012     847 kubelet_node_status.go:106] Unable to register node "0.packet.kube.arroyo.io" with API server: Post https://m.kube.arroyo.io:443/api/v1/nodes: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:35 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:35.417229     847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://m.kube.arroyo.io:443/api/v1/pods?fieldSelector=spec.nodeName%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:35 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:35.418403     847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https://m.kube.arroyo.io:443/api/v1/nodes?fieldSelector=metadata.name%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:35 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:35.418914     847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Service: Get https://m.kube.arroyo.io:443/api/v1/services?resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused
Oct 13 18:03:36 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:36.418899     847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://m.kube.arroyo.io:443/api/v1/pods?fieldSelector=spec.nodeName%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:36 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:36.419976     847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https://m.kube.arroyo.io:443/api/v1/nodes?fieldSelector=metadata.name%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused
Oct 13 18:03:36 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:36.421421     847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Service: Get https://m.kube.arroyo.io:443/api/v1/services?resourceVersion=0: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:36 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:36.642093     847 eviction_manager.go:238] eviction manager: unexpected err: failed GetNode: node '0.packet.kube.arroyo.io' not found
Oct 13 18:03:37 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:37.420655     847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://m.kube.arroyo.io:443/api/v1/pods?fieldSelector=spec.nodeName%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused
Oct 13 18:03:37 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:37.421686     847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https://m.kube.arroyo.io:443/api/v1/nodes?fieldSelector=metadata.name%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused
Oct 13 18:03:37 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:37.423432     847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Service: Get https://m.kube.arroyo.io:443/api/v1/services?resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused

I continually see the same messages over and over again, even after ~1h. Same messages on both master nodes.

Bootkube

Oct 13 17:52:19 0.packet.kube.arroyo.io bash[1033]: [ 1239.158042] bootkube[5]: W1013 17:52:19.215269       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:52:24 0.packet.kube.arroyo.io bash[1033]: [ 1244.157950] bootkube[5]: W1013 17:52:24.215174       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:52:29 0.packet.kube.arroyo.io bash[1033]: [ 1249.158826] bootkube[5]: W1013 17:52:29.216045       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:52:34 0.packet.kube.arroyo.io bash[1033]: [ 1254.157924] bootkube[5]: W1013 17:52:34.215142       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:52:39 0.packet.kube.arroyo.io bash[1033]: [ 1259.158363] bootkube[5]: W1013 17:52:39.215522       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:52:44 0.packet.kube.arroyo.io bash[1033]: [ 1264.159427] bootkube[5]: W1013 17:52:44.216659       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:52:49 0.packet.kube.arroyo.io bash[1033]: [ 1269.158017] bootkube[5]: W1013 17:52:49.215207       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:52:54 0.packet.kube.arroyo.io bash[1033]: [ 1274.158081] bootkube[5]: W1013 17:52:54.215280       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:52:59 0.packet.kube.arroyo.io bash[1033]: [ 1279.158274] bootkube[5]: W1013 17:52:59.215488       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:53:04 0.packet.kube.arroyo.io bash[1033]: [ 1284.158153] bootkube[5]: W1013 17:53:04.215394       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.158057] bootkube[5]: W1013 17:53:09.215279       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.162558] bootkube[5]: W1013 17:53:09.218277       5 create.go:31] Unable to determine api-server readiness: API Server http status: 
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.163276] bootkube[5]: E1013 17:53:09.218328       5 create.go:56] API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.163847] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.164405] bootkube[5]: Tearing down temporary bootstrap control plane...
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.164935] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.165483] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.166117] bootkube[5]: API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 13 17:53:09 0.packet.kube.arroyo.io systemd[1]: Failed to start Bootstrap a Kubernetes cluster.
Oct 13 17:53:09 0.packet.kube.arroyo.io systemd[1]: bootkube.service: Unit entered failed state.
Oct 13 17:53:09 0.packet.kube.arroyo.io systemd[1]: bootkube.service: Failed with result 'exit-code'.

Containers

core@0 ~ $ rkt list
UUID            APP             IMAGE NAME                                      STATE   CREATED         STARTED         NETWORKS
1a664adf        etcd            quay.io/coreos/etcd:v3.1.8                      running 35 minutes ago  35 minutes ago
a5441ca9        bootkube        quay.io/coreos/bootkube:v0.6.2                  exited  34 minutes ago  34 minutes ago
c4e1036a        hyperkube       quay.io/coreos/hyperkube:v1.7.5_coreos.1        running 34 minutes ago  34 minutes ago
core@0 ~ $ docker ps
CONTAINER ID        IMAGE                                                                                              COMMAND                  CREATED             STATUS              PORTS               NAMES
3d57ba781a7f        quay.io/coreos/hyperkube@sha256:c51da5106803f4af64e8154392a68d6c2f84499f02b1a70ac3b34a9f555d0aca   "./hyperkube schedule"   34 minutes ago      Up 34 minutes                           k8s_kube-scheduler_bootstrap-kube-scheduler-0.packet.kube.arroyo.io_kube-system_fde0d90cd32b34a95fba6056f3730959_0
2f7a05067bdf        quay.io/coreos/hyperkube@sha256:c51da5106803f4af64e8154392a68d6c2f84499f02b1a70ac3b34a9f555d0aca   "./hyperkube controll"   34 minutes ago      Up 34 minutes                           k8s_kube-controller-manager_bootstrap-kube-controller-manager-0.packet.kube.arroyo.io_kube-system_145e3a1f8b8920882b6bdaf670d9e8cb_0
4d6f8f3af2d4        gcr.io/google_containers/pause-amd64:3.0                                                           "/pause"                 34 minutes ago      Up 34 minutes                           k8s_POD_bootstrap-kube-controller-manager-0.packet.kube.arroyo.io_kube-system_145e3a1f8b8920882b6bdaf670d9e8cb_0
7cac385e84fa        gcr.io/google_containers/pause-amd64:3.0                                                           "/pause"                 34 minutes ago      Up 34 minutes                           k8s_POD_bootstrap-kube-apiserver-0.packet.kube.arroyo.io_kube-system_8409b095d71b74fbfa1127eed6087304_0
27326293828e        gcr.io/google_containers/pause-amd64:3.0                                                           "/pause"                 34 minutes ago      Up 34 minutes                           k8s_POD_bootstrap-kube-scheduler-0.packet.kube.arroyo.io_kube-system_fde0d90cd32b34a95fba6056f3730959_0

I still have the cluster in this state, so I can report back any further information as necessary.

mmellison avatar Oct 13 '17 18:10 mmellison

i had the same issue with 1.7.5 and terraform file as well since 1.7.5 installer terraform was not recognized and giving weird errors, so i had to use 1.7.3 installer terraform file on 1.7.5 location but same issues bootkube exiting quickly.

rushins avatar Oct 16 '17 01:10 rushins

@seglberg I'm also having this problem. Were you able to solve it ?

rlenferink avatar Oct 22 '17 19:10 rlenferink

Unfortunately no. I haven't had time to investigate further and ended up going back to a 1.7.3 release for now.

Definitely seems to be an issue with the multiple etcd instances failing to communicate with each other, just not sure how to debug something like that in the tectonic ecosystem.

mmellison avatar Oct 22 '17 23:10 mmellison

I was able to take another quick look this evening. It appears to be a race condition between the etcd-member service starting (and thus the etcd pod) and /etc/resolv.conf being populated by systemd-resolved.

Calling systemctl restart etcd-member on the master nodes seemed to correct the DNS issue, allowing bootkube to finish setting up the cluster. Hope this helps.

mmellison avatar Oct 24 '17 02:10 mmellison

In order to fix the race condition for the terraform step, instead of restarting the service manually on all the master nodes, adding additional unit file directives seems to work:

--- a/tectonic/assets/platforms/metal/cl/bootkube-controller.yaml.tmpl
+++ b/tectonic/assets/platforms/metal/cl/bootkube-controller.yaml.tmpl
@@ -6,6 +6,9 @@ systemd:
       dropins:
         - name: 40-etcd-cluster.conf
           contents: |
+            [Unit]
+            Wants=network-online.target
+            After=network-online.target
             [Service]
             Environment="ETCD_IMAGE_TAG={{.etcd_image_tag}}"
             Environment="ETCD_NAME={{.etcd_name}}"

Unsure if this is the proper place to be doing this, but seems to work. Etcd containers now have proper resolv.conf files even after restarting the node, whereas before they would not have.

mmellison avatar Oct 24 '17 12:10 mmellison

I will try both. So thanks for providing a solution for both the installer and terraform cli!

rlenferink avatar Oct 24 '17 17:10 rlenferink

i tried the solutions but no luck . my setup i had 3 masters, 5 workers and i have selected ETCD to proivsion on masters (controllers option during tectonic installer but ETCD is always exiting ..

any clude ? what could be the issue.

Also question to everyone, Did anyone were able to install multi master using ETCD provision on controllers (master nodes) installation ?

rushins avatar Oct 25 '17 04:10 rushins

we see exactly the same problem with the version 1.7.5 on baremetal but not on azure

fforootd avatar Oct 26 '17 17:10 fforootd

Having the exact same problem with 1.7.5 on AWS

allenpc avatar Oct 27 '17 23:10 allenpc

i just tried the solution for the tectonic installer where i restarted etcd-member on the master node, but no luck for me. In my setup I have used 1 master node and 1 worker node using ETCD provisioning on the master node. My other setup was using 1 master node, 1 worker node and 1 node running ETCD. Both didn't get up and running.

On my master node, the bootkube service is unable to start.

journalctl -xe is showing:

Oct 31 07:31:15 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:31:15.355812    3438 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https://node1.coreos.sax:443/api/v1/nodes?fieldSelector=metadata.name%3Dnode1.coreos.sax&resourceVersion=0: dial tcp 10.0.0.200:443: getsockopt: connection refused
Oct 31 07:31:15 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:31:15.358400    3438 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Service: Get https://node1.coreos.sax:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.200:443: getsockopt: connection refused
Oct 31 07:31:16 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:31:16.350778    3438 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://node1.coreos.sax:443/api/v1/pods?fieldSelector=spec.nodeName%3Dnode1.coreos.sax&resourceVersion=0: dial tcp 10.0.0.200:443: getsockopt: connection refused
............
............
Oct 31 07:35:26 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:35:26.210624    3438 kubelet.go:1607] Failed creating a mirror pod for "bootstrap-kube-apiserver-node1.coreos.sax_kube-system(1472dd3c3ae7409ef18710489f25180e)": Post https://node1.coreos.sax:443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.0.200:443: getsockopt: connection refused
Oct 31 07:35:26 node1.coreos.sax kubelet-wrapper[3438]: W1031 07:35:26.898581    3438 cni.go:189] Unable to update cni config: No networks found in /etc/kubernetes/cni/net.d
Oct 31 07:35:26 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:35:26.898726    3438 kubelet.go:2136] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

When executing systemctl start bootkube, systemctl status bootkube is showing:

● bootkube.service - Bootstrap a Kubernetes cluster
   Loaded: loaded (/etc/systemd/system/bootkube.service; disabled; vendor preset: disabled)
   Active: activating (start) since Tue 2017-10-31 07:23:31 UTC; 16min ago
 Main PID: 29128 (bash)
    Tasks: 2 (limit: 32768)
   Memory: 2.1M
      CPU: 90ms
   CGroup: /system.slice/bootkube.service
           ├─29128 /usr/bin/bash /opt/tectonic/bootkube.sh
           └─29131 stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/systemd-nspawn --boot --notify-ready=yes -Zsystem_u:system_r:svirt_lxc_net_t:s0:c766,c893 -Lsystem_u:object_r:svirt_lxc_file_t:

Oct 31 07:38:56 node1.coreos.sax bash[29128]: [57132.341369] bootkube[5]: W1031 07:38:56.738918       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:01 node1.coreos.sax bash[29128]: [57137.341286] bootkube[5]: W1031 07:39:01.738840       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:06 node1.coreos.sax bash[29128]: [57142.341200] bootkube[5]: W1031 07:39:06.738773       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:11 node1.coreos.sax bash[29128]: [57147.341342] bootkube[5]: W1031 07:39:11.738875       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:16 node1.coreos.sax bash[29128]: [57152.341373] bootkube[5]: W1031 07:39:16.738941       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:21 node1.coreos.sax bash[29128]: [57157.341490] bootkube[5]: W1031 07:39:21.739036       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:26 node1.coreos.sax bash[29128]: [57162.341276] bootkube[5]: W1031 07:39:26.738838       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:31 node1.coreos.sax bash[29128]: [57167.341245] bootkube[5]: W1031 07:39:31.738783       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:36 node1.coreos.sax bash[29128]: [57172.341457] bootkube[5]: W1031 07:39:36.739015       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:41 node1.coreos.sax bash[29128]: [57177.340953] bootkube[5]: W1031 07:39:41.738499       5 create.go:31] Unable to determine api-server readiness: API Server http status: 0

After a while, bootkube fails to start and systemctl status bootkube is showing

● bootkube.service - Bootstrap a Kubernetes cluster
   Loaded: loaded (/etc/systemd/system/bootkube.service; disabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2017-10-31 07:43:31 UTC; 4min 14s ago
  Process: 29128 ExecStart=/usr/bin/bash /opt/tectonic/bootkube.sh (code=exited, status=1/FAILURE)
 Main PID: 29128 (code=exited, status=1/FAILURE)
      CPU: 94ms

Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.343786] bootkube[5]: E1031 07:43:31.740684       5 create.go:56] API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.344339] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.344700] bootkube[5]: Tearing down temporary bootstrap control plane...
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.345077] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.345417] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.345785] bootkube[5]: API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 31 07:43:31 node1.coreos.sax systemd[1]: Failed to start Bootstrap a Kubernetes cluster.
Oct 31 07:43:31 node1.coreos.sax systemd[1]: bootkube.service: Unit entered failed state.
Oct 31 07:43:31 node1.coreos.sax systemd[1]: bootkube.service: Failed with result 'exit-code'.

When doing a ls on /etc/kubernetes/cni/net.d/ it appears this directory is empty. Can someone from CoreOS confirm this issue and help thinking for a solution ? @yifan-gu @diegs

rlenferink avatar Oct 31 '17 08:10 rlenferink

In case this is helpful for others, I think my problem was that I was trying to deploy into a new, Terraform-managed VPC while configuring an existing internal DNS zone (tectonic_aws_external_private_zone).

So if you are having issues with bootkube not starting due to the etcd cluster never successfully boostrapping, make sure you are either:

  1. Deploying into a new, Terraform-managed VPC and not specifying any tectonic_aws_external_private_zone
  2. Deploying into an existing VPC, and if specifying an existing private hosted zone, ensuring that the zone is already associated and accessible from within that VPC (haven't tested this, but I assume this needs to be true)

allenpc avatar Nov 01 '17 00:11 allenpc

Coming back on my previous comment. I started from scratch using the tectonic installer using a 1 master and 1 worker setup where the master is provisioned with etcd. The first time I ran this setup, tectonic hangs at "Starting Tectonic" where the second time I ran the installer Tectonic started successfully and I'm now able to reach the tectonic console.

rlenferink avatar Nov 01 '17 07:11 rlenferink

@seglberg : no luck ? as you stated to restart "systemctl restart etcd-member" on master . in my setup i had 3 masters and 3 workers. All 3 mastered etcd exited and no luck after doing "systemctl restart etcd-member " it complains about /etc/ssl/etcd is not directory or file

i tried all your steps but no luck even the same issue in 1.73 .

seems tectonic installer doesn't support multi controller (masters) setup. for me 1 master with 9 workers works all time and no issues but when i do 3 masters (typical multi master system) never works. Can someone from tectonic take a look and help us.

rushins avatar Nov 09 '17 04:11 rushins

As per above logs, I experience the same issue.

yuko11 avatar Jan 19 '18 15:01 yuko11

https://github.com/coreos/tectonic-installer/issues/2129#issuecomment-340948478 is basically on the dot.

The issue is with tectonic_aws_external_private_zone set, the existing zone isn't configured to associated with the VPC created, meaning ec2 instances in the VPC cannot resolve DNS entries in the private hosted zone. Because the etcd nodes require DNS records in the private hosted zone, they never bootstrap because they cannot resolve their configured hostnames.

chancez avatar Feb 09 '18 00:02 chancez

Also specifically mentioned here: https://github.com/coreos/tectonic-installer/issues/1728#issuecomment-323483580

chancez avatar Feb 09 '18 00:02 chancez