tofu-controller icon indicating copy to clipboard operation
tofu-controller copied to clipboard

Terraform resources in new Namespaces are never being created

Open jwitrick opened this issue 1 year ago • 2 comments

I have a cluster setup that involves a LOT of namespaces being created / deleted through the day. And each of those namespaces will have terraform resources deployed. The issue I face is that when the terraform resources are added to the namespace, they never get started. While investigating I see errors in the tf-controller logs similar to:

{"level":"error","ts":"2023-09-27T12:41:37.234Z","msg":"Reconciler error","controller":"terraform","controllerGroup":"infra.contrib.fluxcd.io","controllerKind":"Terraform","Terraform":{"name":"iam-rolepolicy-tiles-extract","namespace":"reports130"},"namespace":"reports130","name":"iam-rolepolicy-tiles-extract","reconcileID":"a14489ce-1290-4572-b30b-7f37a31d3c90","error":"resourceVersion should not be set on objects to be created"}

When i look at the terraform resources in the cluster i see:

cdn-reports130-rails-fulcrum   Unknown   Reconciliation in progress   63m
cdn-reports130-tiles           Unknown   Reconciliation in progress   61m

But there are no tf-runner pods running and no terraform-runner.tls secret

However I dont believe this is the issue (as the exact same resource templates are used in all my namespaces and only some of them show this behavior).

The issue is that the tf-controller service never recognizes the new namespaces, so by the time the terraform resources are created (via helm) they always have the resourceVersion set (this is set automatically when the resource is added to the cluster).

Steps

Here is my complete end to end workflow:

  1. Create a new namespace.
  2. In the new namespace create the tf-runner resources (via helm). The resources include: serviceaccount (tf-runner), clusterrolebinding, clusterrole
  3. Once the helm chart from above is installed, another helm chart will run and create the terraform resources.

About 10% of the time the new namespace gets into this stuck state and the only way to clear it is to delete the tf-controller pod (that leads to other non-ideal behaviors).

What could be ideal is a way to force the tf-controller pod to register the new namespace on demand. Then I can update my CI stack to create the resources and then call the register cmd, before the terraform resources are created.

Stack information

Kubernetes: EKS 1.26 (upgrading to 1.27 soon) tf-controller: 0.16.0-rc.2 (installed via flux and helmrelease)

Files

TF Controller helmrelease

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: tf-controller
  namespace: flux-system
spec:
  releaseName: tf-controller
  chart:
    spec:
      chart: tf-controller
      version: 0.16.0-rc.2
      sourceRef:
        kind: HelmRepository
        name: weaveworks-tf-controller
        namespace: flux-system
  interval: 10m
  values:
    installCRDs: true
    replicaCount: 1
    concurrency: 48
    resources:
      limits:
        cpu: 1000m
        memory: 1Gi

Namespace Runner SerivceAccount

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    meta.helm.sh/release-namespace: reports130
  labels:
    app.kubernetes.io/managed-by: Helm
  name: tf-runner
  namespace: reports130

ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    meta.helm.sh/release-namespace: reports130
  labels:
    app.kubernetes.io/managed-by: Helm
  name: support-common-infra-reports130-tf-runner
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: tf-runner-role
subjects:
- kind: ServiceAccount
  name: tf-runner
  namespace: reports130

Terraform resource

apiVersion: infra.contrib.fluxcd.io/v1alpha2
kind: Terraform
metadata:
  annotations:
    helm.sh/resource-policy: keep
    meta.helm.sh/release-namespace: reports130
  labels:
    app.kubernetes.io/managed-by: Helm
    namespace: reports130
    resourceType: cdn
    tf.weave.works/composite: cdn
  name: cdn-reports130-tiles
  namespace: reports130
spec:
  alwaysCleanupRunnerPod: true
  approvePlan: auto
  destroyResourcesOnDeletion: true
  disableDriftDetection: false
  force: false
  interval: 15h0m
  parallelism: 0
  path: terraform/tf-modules/cdn
  refreshBeforeApply: false
  retryInterval: 20m
  runnerPodTemplate:
    spec:
      envFrom:
      - secretRef:
          name: aws-information
  runnerTerminationGracePeriodSeconds: 30
  serviceAccountName: tf-runner
  sourceRef:
    kind: GitRepository
    name: flux-system
    namespace: flux-system
  storeReadablePlan: none
  values:
    create: true
  workspace: default
  writeOutputsToSecret:
    name: cdn-reports130-tiles

jwitrick avatar Sep 27 '23 13:09 jwitrick