tectonic-installer
tectonic-installer copied to clipboard
Bare Metal Cluster Never Starts using 1.7.5-tectonic.1 Installer
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT
Versions
- Tectonic version (release or commit hash): 1.7.5-tectonic.1
- Terraform version (
terraform version
): v0.10.4 (Bundled w/ Installer) - Platform (aws|azure|openstack|metal): Metal
What happened?
I am able to successfully run Terraform, however the cluster never fully comes up. It appears the apiserver is not starting correctly, possibly caused by issues with etcd. I am still learning how everything works, so not sure if the two are related. See log snippets at end of post.
Cluster starts correctly using the installers from the 1.7.3 releases, but even after 3 attempts with the 1.7.5-tectonic.1 installer, I am unable to get a cluster going.
What you expected to happen?
The apiserver (and thus the cluster) to become available.
How to reproduce it (as minimally and precisely as possible)?
- Two Master Nodes
- One Worker Node
- Provisioned Etcd
Anything else we need to know?
See Logs Snippets Below -
Etcd
Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.382731 W | rafthttp: health check for peer 5d62e2d0c21c6423 could not connect: dial tcp: lookup 1.packet.kube.arroyo.io on [::1]:53: read udp [::1]:43632->[::1]:53: read: connection refused
Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.450686 I | raft: 8b84d3e5347e393e is starting a new election at term 1404
Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.450730 I | raft: 8b84d3e5347e393e became candidate at term 1405
Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.450754 I | raft: 8b84d3e5347e393e received MsgVoteResp from 8b84d3e5347e393e at term 1405
Oct 13 18:05:52 0.packet.kube.arroyo.io etcd-wrapper[772]: 2017-10-13 18:05:52.450781 I | raft: 8b84d3e5347e393e [logterm: 78, index: 10] sent MsgVote request to 5d62e2d0c21c6423 at term 1405
Kubelet
Oct 13 18:03:34 0.packet.kube.arroyo.io kubelet-wrapper[847]: I1013 18:03:34.606653 847 kubelet_node_status.go:247] Setting node annotation to enable volume controller attach/detach
Oct 13 18:03:34 0.packet.kube.arroyo.io kubelet-wrapper[847]: I1013 18:03:34.609618 847 kubelet_node_status.go:82] Attempting to register node 0.packet.kube.arroyo.io
Oct 13 18:03:34 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:34.611012 847 kubelet_node_status.go:106] Unable to register node "0.packet.kube.arroyo.io" with API server: Post https://m.kube.arroyo.io:443/api/v1/nodes: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:35 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:35.417229 847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://m.kube.arroyo.io:443/api/v1/pods?fieldSelector=spec.nodeName%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:35 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:35.418403 847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https://m.kube.arroyo.io:443/api/v1/nodes?fieldSelector=metadata.name%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:35 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:35.418914 847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Service: Get https://m.kube.arroyo.io:443/api/v1/services?resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused
Oct 13 18:03:36 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:36.418899 847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://m.kube.arroyo.io:443/api/v1/pods?fieldSelector=spec.nodeName%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:36 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:36.419976 847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https://m.kube.arroyo.io:443/api/v1/nodes?fieldSelector=metadata.name%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused
Oct 13 18:03:36 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:36.421421 847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Service: Get https://m.kube.arroyo.io:443/api/v1/services?resourceVersion=0: dial tcp 147.75.77.219:443: getsockopt: connection refused
Oct 13 18:03:36 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:36.642093 847 eviction_manager.go:238] eviction manager: unexpected err: failed GetNode: node '0.packet.kube.arroyo.io' not found
Oct 13 18:03:37 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:37.420655 847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://m.kube.arroyo.io:443/api/v1/pods?fieldSelector=spec.nodeName%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused
Oct 13 18:03:37 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:37.421686 847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https://m.kube.arroyo.io:443/api/v1/nodes?fieldSelector=metadata.name%3D0.packet.kube.arroyo.io&resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused
Oct 13 18:03:37 0.packet.kube.arroyo.io kubelet-wrapper[847]: E1013 18:03:37.423432 847 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Service: Get https://m.kube.arroyo.io:443/api/v1/services?resourceVersion=0: dial tcp 147.75.76.133:443: getsockopt: connection refused
I continually see the same messages over and over again, even after ~1h. Same messages on both master nodes.
Bootkube
Oct 13 17:52:19 0.packet.kube.arroyo.io bash[1033]: [ 1239.158042] bootkube[5]: W1013 17:52:19.215269 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:52:24 0.packet.kube.arroyo.io bash[1033]: [ 1244.157950] bootkube[5]: W1013 17:52:24.215174 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:52:29 0.packet.kube.arroyo.io bash[1033]: [ 1249.158826] bootkube[5]: W1013 17:52:29.216045 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:52:34 0.packet.kube.arroyo.io bash[1033]: [ 1254.157924] bootkube[5]: W1013 17:52:34.215142 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:52:39 0.packet.kube.arroyo.io bash[1033]: [ 1259.158363] bootkube[5]: W1013 17:52:39.215522 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:52:44 0.packet.kube.arroyo.io bash[1033]: [ 1264.159427] bootkube[5]: W1013 17:52:44.216659 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:52:49 0.packet.kube.arroyo.io bash[1033]: [ 1269.158017] bootkube[5]: W1013 17:52:49.215207 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:52:54 0.packet.kube.arroyo.io bash[1033]: [ 1274.158081] bootkube[5]: W1013 17:52:54.215280 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:52:59 0.packet.kube.arroyo.io bash[1033]: [ 1279.158274] bootkube[5]: W1013 17:52:59.215488 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:53:04 0.packet.kube.arroyo.io bash[1033]: [ 1284.158153] bootkube[5]: W1013 17:53:04.215394 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.158057] bootkube[5]: W1013 17:53:09.215279 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.162558] bootkube[5]: W1013 17:53:09.218277 5 create.go:31] Unable to determine api-server readiness: API Server http status:
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.163276] bootkube[5]: E1013 17:53:09.218328 5 create.go:56] API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.163847] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.164405] bootkube[5]: Tearing down temporary bootstrap control plane...
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.164935] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.165483] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io bash[1033]: [ 1289.166117] bootkube[5]: API Server is not ready: timed out waiting for the condition
Oct 13 17:53:09 0.packet.kube.arroyo.io systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 13 17:53:09 0.packet.kube.arroyo.io systemd[1]: Failed to start Bootstrap a Kubernetes cluster.
Oct 13 17:53:09 0.packet.kube.arroyo.io systemd[1]: bootkube.service: Unit entered failed state.
Oct 13 17:53:09 0.packet.kube.arroyo.io systemd[1]: bootkube.service: Failed with result 'exit-code'.
Containers
core@0 ~ $ rkt list
UUID APP IMAGE NAME STATE CREATED STARTED NETWORKS
1a664adf etcd quay.io/coreos/etcd:v3.1.8 running 35 minutes ago 35 minutes ago
a5441ca9 bootkube quay.io/coreos/bootkube:v0.6.2 exited 34 minutes ago 34 minutes ago
c4e1036a hyperkube quay.io/coreos/hyperkube:v1.7.5_coreos.1 running 34 minutes ago 34 minutes ago
core@0 ~ $ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3d57ba781a7f quay.io/coreos/hyperkube@sha256:c51da5106803f4af64e8154392a68d6c2f84499f02b1a70ac3b34a9f555d0aca "./hyperkube schedule" 34 minutes ago Up 34 minutes k8s_kube-scheduler_bootstrap-kube-scheduler-0.packet.kube.arroyo.io_kube-system_fde0d90cd32b34a95fba6056f3730959_0
2f7a05067bdf quay.io/coreos/hyperkube@sha256:c51da5106803f4af64e8154392a68d6c2f84499f02b1a70ac3b34a9f555d0aca "./hyperkube controll" 34 minutes ago Up 34 minutes k8s_kube-controller-manager_bootstrap-kube-controller-manager-0.packet.kube.arroyo.io_kube-system_145e3a1f8b8920882b6bdaf670d9e8cb_0
4d6f8f3af2d4 gcr.io/google_containers/pause-amd64:3.0 "/pause" 34 minutes ago Up 34 minutes k8s_POD_bootstrap-kube-controller-manager-0.packet.kube.arroyo.io_kube-system_145e3a1f8b8920882b6bdaf670d9e8cb_0
7cac385e84fa gcr.io/google_containers/pause-amd64:3.0 "/pause" 34 minutes ago Up 34 minutes k8s_POD_bootstrap-kube-apiserver-0.packet.kube.arroyo.io_kube-system_8409b095d71b74fbfa1127eed6087304_0
27326293828e gcr.io/google_containers/pause-amd64:3.0 "/pause" 34 minutes ago Up 34 minutes k8s_POD_bootstrap-kube-scheduler-0.packet.kube.arroyo.io_kube-system_fde0d90cd32b34a95fba6056f3730959_0
I still have the cluster in this state, so I can report back any further information as necessary.
i had the same issue with 1.7.5 and terraform file as well since 1.7.5 installer terraform was not recognized and giving weird errors, so i had to use 1.7.3 installer terraform file on 1.7.5 location but same issues bootkube exiting quickly.
@seglberg I'm also having this problem. Were you able to solve it ?
Unfortunately no. I haven't had time to investigate further and ended up going back to a 1.7.3 release for now.
Definitely seems to be an issue with the multiple etcd instances failing to communicate with each other, just not sure how to debug something like that in the tectonic ecosystem.
I was able to take another quick look this evening. It appears to be a race condition between the etcd-member
service starting (and thus the etcd pod) and /etc/resolv.conf
being populated by systemd-resolved
.
Calling systemctl restart etcd-member
on the master nodes seemed to correct the DNS issue, allowing bootkube to finish setting up the cluster. Hope this helps.
In order to fix the race condition for the terraform step, instead of restarting the service manually on all the master nodes, adding additional unit file directives seems to work:
--- a/tectonic/assets/platforms/metal/cl/bootkube-controller.yaml.tmpl
+++ b/tectonic/assets/platforms/metal/cl/bootkube-controller.yaml.tmpl
@@ -6,6 +6,9 @@ systemd:
dropins:
- name: 40-etcd-cluster.conf
contents: |
+ [Unit]
+ Wants=network-online.target
+ After=network-online.target
[Service]
Environment="ETCD_IMAGE_TAG={{.etcd_image_tag}}"
Environment="ETCD_NAME={{.etcd_name}}"
Unsure if this is the proper place to be doing this, but seems to work. Etcd containers now have proper resolv.conf
files even after restarting the node, whereas before they would not have.
I will try both. So thanks for providing a solution for both the installer and terraform cli!
i tried the solutions but no luck . my setup i had 3 masters, 5 workers and i have selected ETCD to proivsion on masters (controllers option during tectonic installer but ETCD is always exiting ..
any clude ? what could be the issue.
Also question to everyone, Did anyone were able to install multi master using ETCD provision on controllers (master nodes) installation ?
we see exactly the same problem with the version 1.7.5 on baremetal but not on azure
Having the exact same problem with 1.7.5 on AWS
i just tried the solution for the tectonic installer where i restarted etcd-member on the master node, but no luck for me. In my setup I have used 1 master node and 1 worker node using ETCD provisioning on the master node. My other setup was using 1 master node, 1 worker node and 1 node running ETCD. Both didn't get up and running.
On my master node, the bootkube service is unable to start.
journalctl -xe
is showing:
Oct 31 07:31:15 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:31:15.355812 3438 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:408: Failed to list *v1.Node: Get https://node1.coreos.sax:443/api/v1/nodes?fieldSelector=metadata.name%3Dnode1.coreos.sax&resourceVersion=0: dial tcp 10.0.0.200:443: getsockopt: connection refused
Oct 31 07:31:15 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:31:15.358400 3438 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/kubelet.go:400: Failed to list *v1.Service: Get https://node1.coreos.sax:443/api/v1/services?resourceVersion=0: dial tcp 10.0.0.200:443: getsockopt: connection refused
Oct 31 07:31:16 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:31:16.350778 3438 reflector.go:190] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://node1.coreos.sax:443/api/v1/pods?fieldSelector=spec.nodeName%3Dnode1.coreos.sax&resourceVersion=0: dial tcp 10.0.0.200:443: getsockopt: connection refused
............
............
Oct 31 07:35:26 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:35:26.210624 3438 kubelet.go:1607] Failed creating a mirror pod for "bootstrap-kube-apiserver-node1.coreos.sax_kube-system(1472dd3c3ae7409ef18710489f25180e)": Post https://node1.coreos.sax:443/api/v1/namespaces/kube-system/pods: dial tcp 10.0.0.200:443: getsockopt: connection refused
Oct 31 07:35:26 node1.coreos.sax kubelet-wrapper[3438]: W1031 07:35:26.898581 3438 cni.go:189] Unable to update cni config: No networks found in /etc/kubernetes/cni/net.d
Oct 31 07:35:26 node1.coreos.sax kubelet-wrapper[3438]: E1031 07:35:26.898726 3438 kubelet.go:2136] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
When executing systemctl start bootkube
, systemctl status bootkube
is showing:
● bootkube.service - Bootstrap a Kubernetes cluster
Loaded: loaded (/etc/systemd/system/bootkube.service; disabled; vendor preset: disabled)
Active: activating (start) since Tue 2017-10-31 07:23:31 UTC; 16min ago
Main PID: 29128 (bash)
Tasks: 2 (limit: 32768)
Memory: 2.1M
CPU: 90ms
CGroup: /system.slice/bootkube.service
├─29128 /usr/bin/bash /opt/tectonic/bootkube.sh
└─29131 stage1/rootfs/usr/lib/ld-linux-x86-64.so.2 stage1/rootfs/usr/bin/systemd-nspawn --boot --notify-ready=yes -Zsystem_u:system_r:svirt_lxc_net_t:s0:c766,c893 -Lsystem_u:object_r:svirt_lxc_file_t:
Oct 31 07:38:56 node1.coreos.sax bash[29128]: [57132.341369] bootkube[5]: W1031 07:38:56.738918 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:01 node1.coreos.sax bash[29128]: [57137.341286] bootkube[5]: W1031 07:39:01.738840 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:06 node1.coreos.sax bash[29128]: [57142.341200] bootkube[5]: W1031 07:39:06.738773 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:11 node1.coreos.sax bash[29128]: [57147.341342] bootkube[5]: W1031 07:39:11.738875 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:16 node1.coreos.sax bash[29128]: [57152.341373] bootkube[5]: W1031 07:39:16.738941 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:21 node1.coreos.sax bash[29128]: [57157.341490] bootkube[5]: W1031 07:39:21.739036 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:26 node1.coreos.sax bash[29128]: [57162.341276] bootkube[5]: W1031 07:39:26.738838 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:31 node1.coreos.sax bash[29128]: [57167.341245] bootkube[5]: W1031 07:39:31.738783 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:36 node1.coreos.sax bash[29128]: [57172.341457] bootkube[5]: W1031 07:39:36.739015 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
Oct 31 07:39:41 node1.coreos.sax bash[29128]: [57177.340953] bootkube[5]: W1031 07:39:41.738499 5 create.go:31] Unable to determine api-server readiness: API Server http status: 0
After a while, bootkube fails to start and systemctl status bootkube
is showing
● bootkube.service - Bootstrap a Kubernetes cluster
Loaded: loaded (/etc/systemd/system/bootkube.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2017-10-31 07:43:31 UTC; 4min 14s ago
Process: 29128 ExecStart=/usr/bin/bash /opt/tectonic/bootkube.sh (code=exited, status=1/FAILURE)
Main PID: 29128 (code=exited, status=1/FAILURE)
CPU: 94ms
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.343786] bootkube[5]: E1031 07:43:31.740684 5 create.go:56] API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.344339] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.344700] bootkube[5]: Tearing down temporary bootstrap control plane...
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.345077] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.345417] bootkube[5]: Error: API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax bash[29128]: [57407.345785] bootkube[5]: API Server is not ready: timed out waiting for the condition
Oct 31 07:43:31 node1.coreos.sax systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE
Oct 31 07:43:31 node1.coreos.sax systemd[1]: Failed to start Bootstrap a Kubernetes cluster.
Oct 31 07:43:31 node1.coreos.sax systemd[1]: bootkube.service: Unit entered failed state.
Oct 31 07:43:31 node1.coreos.sax systemd[1]: bootkube.service: Failed with result 'exit-code'.
When doing a ls
on /etc/kubernetes/cni/net.d/
it appears this directory is empty. Can someone from CoreOS confirm this issue and help thinking for a solution ? @yifan-gu @diegs
In case this is helpful for others, I think my problem was that I was trying to deploy into a new, Terraform-managed VPC while configuring an existing internal DNS zone (tectonic_aws_external_private_zone
).
So if you are having issues with bootkube not starting due to the etcd cluster never successfully boostrapping, make sure you are either:
- Deploying into a new, Terraform-managed VPC and not specifying any
tectonic_aws_external_private_zone
- Deploying into an existing VPC, and if specifying an existing private hosted zone, ensuring that the zone is already associated and accessible from within that VPC (haven't tested this, but I assume this needs to be true)
Coming back on my previous comment. I started from scratch using the tectonic installer using a 1 master and 1 worker setup where the master is provisioned with etcd. The first time I ran this setup, tectonic hangs at "Starting Tectonic" where the second time I ran the installer Tectonic started successfully and I'm now able to reach the tectonic console.
@seglberg : no luck ? as you stated to restart "systemctl restart etcd-member" on master . in my setup i had 3 masters and 3 workers. All 3 mastered etcd exited and no luck after doing "systemctl restart etcd-member " it complains about /etc/ssl/etcd is not directory or file
i tried all your steps but no luck even the same issue in 1.73 .
seems tectonic installer doesn't support multi controller (masters) setup. for me 1 master with 9 workers works all time and no issues but when i do 3 masters (typical multi master system) never works. Can someone from tectonic take a look and help us.
As per above logs, I experience the same issue.
https://github.com/coreos/tectonic-installer/issues/2129#issuecomment-340948478 is basically on the dot.
The issue is with tectonic_aws_external_private_zone
set, the existing zone isn't configured to associated with the VPC created, meaning ec2 instances in the VPC cannot resolve DNS entries in the private hosted zone. Because the etcd nodes require DNS records in the private hosted zone, they never bootstrap because they cannot resolve their configured hostnames.
Also specifically mentioned here: https://github.com/coreos/tectonic-installer/issues/1728#issuecomment-323483580