flux2 icon indicating copy to clipboard operation
flux2 copied to clipboard

Flux Bootstrap Failing on Private EKS Cluster

Open preston-willis opened this issue 1 year ago • 0 comments

Describe the bug

Flux bootstrap fails to reconcile and every controller is stuck in a Termination loop. Flux logs, flux events, kubectl logs returns nothing, flux check hangs. All kube-system pods are healthy. All flux pods exit with code 2

Kubectl describe pods -n flux-system:

ubuntu@ip-10-1-159-28:~$ kubectl describe pods -n flux-system
Name:                 helm-controller-7c8b698656-k4f4f
Namespace:            flux-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      helm-controller
Node:                 ip-10-1-147-215.us-east-2.compute.internal/10.1.147.215
Start Time:           Mon, 09 Oct 2023 09:36:57 +0000
Labels:               app=helm-controller
                      pod-template-hash=7c8b698656
Annotations:          prometheus.io/port: 8080
                      prometheus.io/scrape: true
Status:               Running
IP:                   10.1.159.4
IPs:
  IP:           10.1.159.4
Controlled By:  ReplicaSet/helm-controller-7c8b698656
Containers:
  manager:
    Container ID:    containerd://b6de55f6dbf193e86bfc502062651766e71e050940eb54c5f4d7e3f6632eba5a
    Image:           ghcr.io/fluxcd/helm-controller:v0.36.1
    Image ID:        ghcr.io/fluxcd/helm-controller@sha256:0378fd84ed0ef430414e0ac5bd79cdc03899ba787c22561e474650498d231ca6
    Ports:           8080/TCP, 9440/TCP
    Host Ports:      0/TCP, 0/TCP
    SeccompProfile:  RuntimeDefault
    Args:
      --events-addr=http://notification-controller.flux-system.svc.cluster.local./
      --watch-all-namespaces=true
      --log-level=info
      --log-encoding=json
      --enable-leader-election
    State:          Running
      Started:      Mon, 09 Oct 2023 09:37:28 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 09 Oct 2023 09:36:59 +0000
      Finished:     Mon, 09 Oct 2023 09:37:27 +0000
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   64Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:  flux-system (v1:metadata.namespace)
    Mounts:
      /tmp from temp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jsv7n (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-jsv7n:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  50s                 default-scheduler  Successfully assigned flux-system/helm-controller-7c8b698656-k4f4f to ip-10-1-147-215.us-east-2.compute.internal
  Normal   Pulled     20s (x2 over 49s)   kubelet            Container image "ghcr.io/fluxcd/helm-controller:v0.36.1" already present on machine
  Normal   Created    20s (x2 over 48s)   kubelet            Created container manager
  Normal   Killing    20s                 kubelet            Container manager failed liveness probe, will be restarted
  Normal   Started    19s (x2 over 48s)   kubelet            Started container manager
  Warning  Unhealthy  10s (x10 over 48s)  kubelet            Readiness probe failed: Get "http://10.1.159.4:9440/readyz": dial tcp 10.1.159.4:9440: connect: connection refused
  Warning  Unhealthy  10s (x4 over 40s)   kubelet            Liveness probe failed: Get "http://10.1.159.4:9440/healthz": dial tcp 10.1.159.4:9440: connect: connection refused


Name:                 kustomize-controller-858996fc8d-ls654
Namespace:            flux-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      kustomize-controller
Node:                 ip-10-1-147-215.us-east-2.compute.internal/10.1.147.215
Start Time:           Mon, 09 Oct 2023 09:36:57 +0000
Labels:               app=kustomize-controller
                      pod-template-hash=858996fc8d
Annotations:          prometheus.io/port: 8080
                      prometheus.io/scrape: true
Status:               Running
IP:                   10.1.147.195
IPs:
  IP:           10.1.147.195
Controlled By:  ReplicaSet/kustomize-controller-858996fc8d
Containers:
  manager:
    Container ID:    containerd://42385ea048ac8ee9368cdd0bf340269cd561cc4d0b274fc866cdb951d6152131
    Image:           ghcr.io/fluxcd/kustomize-controller:v1.1.0
    Image ID:        ghcr.io/fluxcd/kustomize-controller@sha256:1f7380a0c7871a7149ca67fb1ba20865566f8f381d14cec2e5ac6af40d96ca55
    Ports:           8080/TCP, 9440/TCP
    Host Ports:      0/TCP, 0/TCP
    SeccompProfile:  RuntimeDefault
    Args:
      --events-addr=http://notification-controller.flux-system.svc.cluster.local./
      --watch-all-namespaces=true
      --log-level=info
      --log-encoding=json
      --enable-leader-election
    State:          Running
      Started:      Mon, 09 Oct 2023 09:36:59 +0000
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   64Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:  flux-system (v1:metadata.namespace)
    Mounts:
      /tmp from temp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qdzkz (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-qdzkz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  50s                default-scheduler  Successfully assigned flux-system/kustomize-controller-858996fc8d-ls654 to ip-10-1-147-215.us-east-2.compute.internal
  Normal   Pulled     49s                kubelet            Container image "ghcr.io/fluxcd/kustomize-controller:v1.1.0" already present on machine
  Normal   Created    49s                kubelet            Created container manager
  Normal   Started    48s                kubelet            Started container manager
  Warning  Unhealthy  20s (x7 over 48s)  kubelet            Readiness probe failed: Get "http://10.1.147.195:9440/readyz": dial tcp 10.1.147.195:9440: connect: connection refused
  Warning  Unhealthy  20s (x3 over 40s)  kubelet            Liveness probe failed: Get "http://10.1.147.195:9440/healthz": dial tcp 10.1.147.195:9440: connect: connection refused
  Normal   Killing    20s                kubelet            Container manager failed liveness probe, will be restarted
  Warning  Unhealthy  9s                 kubelet            Readiness probe failed: Get "http://10.1.147.195:9440/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)


Name:             notification-controller-ddf44665d-sbl96
Namespace:        flux-system
Priority:         0
Service Account:  notification-controller
Node:             ip-10-1-147-215.us-east-2.compute.internal/10.1.147.215
Start Time:       Mon, 09 Oct 2023 09:36:57 +0000
Labels:           app=notification-controller
                  pod-template-hash=ddf44665d
Annotations:      prometheus.io/port: 8080
                  prometheus.io/scrape: true
Status:           Running
IP:               10.1.144.245
IPs:
  IP:           10.1.144.245
Controlled By:  ReplicaSet/notification-controller-ddf44665d
Containers:
  manager:
    Container ID:    containerd://e3d519db6b0c9f1f64cb4e843c626a82bb28363943ccb5a69a32a634b08daa2f
    Image:           ghcr.io/fluxcd/notification-controller:v1.1.0
    Image ID:        ghcr.io/fluxcd/notification-controller@sha256:21bd40a9856d0faba9d769b7bc9b6153edf26a8e6ef2d1b7c3730c35e1942213
    Ports:           9090/TCP, 9292/TCP, 8080/TCP, 9440/TCP
    Host Ports:      0/TCP, 0/TCP, 0/TCP, 0/TCP
    SeccompProfile:  RuntimeDefault
    Args:
      --watch-all-namespaces=true
      --log-level=info
      --log-encoding=json
      --enable-leader-election
    State:          Running
      Started:      Mon, 09 Oct 2023 09:37:28 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 09 Oct 2023 09:36:59 +0000
      Finished:     Mon, 09 Oct 2023 09:37:27 +0000
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   64Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:  flux-system (v1:metadata.namespace)
    Mounts:
      /tmp from temp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qb5r2 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  temp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-qb5r2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  50s                default-scheduler  Successfully assigned flux-system/notification-controller-ddf44665d-sbl96 to ip-10-1-147-215.us-east-2.compute.internal
  Normal   Pulled     20s (x2 over 49s)  kubelet            Container image "ghcr.io/fluxcd/notification-controller:v1.1.0" already present on machine
  Normal   Created    20s (x2 over 49s)  kubelet            Created container manager
  Normal   Killing    20s                kubelet            Container manager failed liveness probe, will be restarted
  Normal   Started    19s (x2 over 48s)  kubelet            Started container manager
  Warning  Unhealthy  10s (x9 over 48s)  kubelet            Readiness probe failed: Get "http://10.1.144.245:9440/readyz": dial tcp 10.1.144.245:9440: connect: connection refused
  Warning  Unhealthy  10s (x4 over 40s)  kubelet            Liveness probe failed: Get "http://10.1.144.245:9440/healthz": dial tcp 10.1.144.245:9440: connect: connection refused


Name:                 source-controller-594c848975-5wdst
Namespace:            flux-system
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Service Account:      source-controller
Node:                 ip-10-1-147-215.us-east-2.compute.internal/10.1.147.215
Start Time:           Mon, 09 Oct 2023 09:36:57 +0000
Labels:               app=source-controller
                      pod-template-hash=594c848975
Annotations:          prometheus.io/port: 8080
                      prometheus.io/scrape: true
Status:               Running
IP:                   10.1.144.144
IPs:
  IP:           10.1.144.144
Controlled By:  ReplicaSet/source-controller-594c848975
Containers:
  manager:
    Container ID:    containerd://37cc169a6cf4eb355565ee36436d1bcaeb54d6bdbc7044861db7f73bcdc2248b
    Image:           ghcr.io/fluxcd/source-controller:v1.1.1
    Image ID:        ghcr.io/fluxcd/source-controller@sha256:a9b4ffe2c145efd9cb71c3d41824eda17dc41dc9e9e8bc3b51bfc86b2243c6a4
    Ports:           9090/TCP, 8080/TCP, 9440/TCP
    Host Ports:      0/TCP, 0/TCP, 0/TCP
    SeccompProfile:  RuntimeDefault
    Args:
      --events-addr=http://notification-controller.flux-system.svc.cluster.local./
      --watch-all-namespaces=true
      --log-level=info
      --log-encoding=json
      --enable-leader-election
      --storage-path=/data
      --storage-adv-addr=source-controller.$(RUNTIME_NAMESPACE).svc.cluster.local.
    State:          Running
      Started:      Mon, 09 Oct 2023 09:37:28 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Mon, 09 Oct 2023 09:36:59 +0000
      Finished:     Mon, 09 Oct 2023 09:37:27 +0000
    Ready:          False
    Restart Count:  1
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:      50m
      memory:   64Mi
    Liveness:   http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      RUNTIME_NAMESPACE:  flux-system (v1:metadata.namespace)
      TUF_ROOT:           /tmp/.sigstore
    Mounts:
      /data from data (rw)
      /tmp from tmp (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-btqw5 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  kube-api-access-btqw5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  50s                 default-scheduler  Successfully assigned flux-system/source-controller-594c848975-5wdst to ip-10-1-147-215.us-east-2.compute.internal
  Normal   Pulled     20s (x2 over 49s)   kubelet            Container image "ghcr.io/fluxcd/source-controller:v1.1.1" already present on machine
  Normal   Created    20s (x2 over 49s)   kubelet            Created container manager
  Normal   Killing    20s                 kubelet            Container manager failed liveness probe, will be restarted
  Normal   Started    19s (x2 over 48s)   kubelet            Started container manager
  Warning  Unhealthy  10s (x10 over 48s)  kubelet            Readiness probe failed: Get "http://10.1.144.144:9090/": dial tcp 10.1.144.144:9090: connect: connection refused
  Warning  Unhealthy  10s (x4 over 40s)   kubelet            Liveness probe failed: Get "http://10.1.144.144:9440/healthz": dial tcp 10.1.144.144:9440: connect: connection refused

Steps to reproduce

Bootstrap flux on a private eks cluster

Expected behavior

Flux bootstrap succeeds

Screenshots and recordings

No response

OS / Distro

AWS EKS t3.medium Amazon Linux 2 amd64

Flux version

v2.1.1

Flux check

ubuntu@ip-10-1-159-28:~$ flux check
► checking prerequisites
✔ Kubernetes 1.27.4-eks-2d98532 >=1.25.0-0
► checking controllers

Git provider

Bitbucket

Container Registry provider

ecr

Additional context

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

preston-willis avatar Oct 09 '23 09:10 preston-willis