contrib icon indicating copy to clipboard operation
contrib copied to clipboard

[AWS | Terraform] Can't access Kubernetes API server on control nodes.

Open FabianSchurig opened this issue 6 months ago • 2 comments

There seems to be a problem with the startup with Talos and the Kubernetes API server when deployed on AWS through Terraform.

I also noted that the LoadBalancer instances seem unhealthy, probably because of that?

This is my terraform.tfvars

cluster_name = "talos"
vpc_cidr = "172.16.0.0/16"
kubernetes_api_allowed_cidr = "0.0.0.0/0"  # Restrict this in production
talos_api_allowed_cidr = "0.0.0.0/0"       # Restrict this in production

# Optional: Enable AWS Cloud Controller Manager
ccm = true

# Optional: Configure control plane nodes
control_plane = {
  instance_type = "c5.large"
  num_instances = 3
}

# Optional: Configure worker nodes
worker_groups = [
  {
    name = "default-workers"
    instance_type = "c5.large"
    num_instances = 2
  }
]

# Optional: Add any config patch files
config_patch_files = []

# Optional: Add extra tags
extra_tags = {
  Environment = "Development"
}

support.zip

Image

discovered nodes: ["172.16.188.85" "172.16.60.34" "172.16.68.50" "172.16.6.102" "172.16.70.52"]
waiting for etcd to be healthy: ...
waiting for etcd to be healthy: OK
waiting for etcd members to be consistent across nodes: ...
waiting for etcd members to be consistent across nodes: OK
waiting for etcd members to be control plane nodes: ...
waiting for etcd members to be control plane nodes: OK
waiting for apid to be ready: ...
waiting for apid to be ready: OK
waiting for all nodes memory sizes: ...
waiting for all nodes memory sizes: OK
waiting for all nodes disk sizes: ...
waiting for all nodes disk sizes: OK
waiting for no diagnostics: ...
waiting for no diagnostics: OK
waiting for kubelet to be healthy: ...
waiting for kubelet to be healthy: OK
waiting for all nodes to finish boot sequence: ...
waiting for all nodes to finish boot sequence: OK
waiting for all k8s nodes to report: ...
waiting for all k8s nodes to report: Get "https://talos-k8s-api-**********.us-east-1.elb.amazonaws.com/api/v1/nodes": EOF
talosctl --nodes ********* version
Client:
        Tag:         v1.9.5
        SHA:         d07f6daa
        Built:       
        Go version:  go1.23.7
        OS/Arch:     linux/amd64
Server:
        NODE:        **********
        Tag:         v1.9.5
        SHA:         d07f6daa
        Built:       
        Go version:  go1.23.7
        OS/Arch:     linux/amd64
        Enabled:     RBAC

Do I miss anything in the configuration or is there anything missing in the terraform config? Why does the api server not start correctly?

Based on the logs, the key issue is that the apiserver is unable to start due to connectivity problems. The main error pattern shows:

  1. The kubelet is repeatedly trying to connect to the API server at https://127.0.0.1:7445 but consistently getting EOF errors
  2. The kubelet can't register the node with the API server: "Unable to register node with API server","err":"Post \"https://127.0.0.1:7445/api/v1/nodes\": EOF"
  3. Certificate-related issues: "Failed while requesting a signed certificate from the control plane","err":"cannot create certificate signing request: Post \"https://127.0.0.1:7445/apis/certificates.k8s.io/v1/certificatesigningrequests\": EOF"
  4. The kube-apiserver container is in a CrashLoopBackOff state: "Error syncing pod, skipping","pod":{"name":"kube-apiserver-ip-172-16-119-78.ec2.internal","namespace":"kube-system"},"podUID":"32824c45ee9b9a9eb1524c46093e02da","err":"failed to \"StartContainer\" for \"kube-apiserver\" with CrashLoopBackOff"

The EOF errors specifically indicate that connections to the API server are being reset or closed unexpectedly, suggesting either the API server isn't running or there's a network/connectivity issue between kubelet and the API server endpoint.

FabianSchurig avatar Apr 23 '25 15:04 FabianSchurig