tofu-controller
tofu-controller copied to clipboard
Terraform resources in new Namespaces are never being created
I have a cluster setup that involves a LOT of namespaces being created / deleted through the day. And each of those namespaces will have terraform resources deployed. The issue I face is that when the terraform resources are added to the namespace, they never get started. While investigating I see errors in the tf-controller logs similar to:
{"level":"error","ts":"2023-09-27T12:41:37.234Z","msg":"Reconciler error","controller":"terraform","controllerGroup":"infra.contrib.fluxcd.io","controllerKind":"Terraform","Terraform":{"name":"iam-rolepolicy-tiles-extract","namespace":"reports130"},"namespace":"reports130","name":"iam-rolepolicy-tiles-extract","reconcileID":"a14489ce-1290-4572-b30b-7f37a31d3c90","error":"resourceVersion should not be set on objects to be created"}
When i look at the terraform resources in the cluster i see:
cdn-reports130-rails-fulcrum Unknown Reconciliation in progress 63m
cdn-reports130-tiles Unknown Reconciliation in progress 61m
But there are no tf-runner pods running and no terraform-runner.tls secret
However I dont believe this is the issue (as the exact same resource templates are used in all my namespaces and only some of them show this behavior).
The issue is that the tf-controller service never recognizes the new namespaces, so by the time the terraform resources are created (via helm) they always have the resourceVersion set (this is set automatically when the resource is added to the cluster).
Steps
Here is my complete end to end workflow:
- Create a new namespace.
- In the new namespace create the tf-runner resources (via helm). The resources include: serviceaccount (tf-runner), clusterrolebinding, clusterrole
- Once the helm chart from above is installed, another helm chart will run and create the terraform resources.
About 10% of the time the new namespace gets into this stuck state and the only way to clear it is to delete the tf-controller pod (that leads to other non-ideal behaviors).
What could be ideal is a way to force the tf-controller pod to register the new namespace on demand. Then I can update my CI stack to create the resources and then call the register cmd, before the terraform resources are created.
Stack information
Kubernetes: EKS 1.26 (upgrading to 1.27 soon) tf-controller: 0.16.0-rc.2 (installed via flux and helmrelease)
Files
TF Controller helmrelease
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: tf-controller
namespace: flux-system
spec:
releaseName: tf-controller
chart:
spec:
chart: tf-controller
version: 0.16.0-rc.2
sourceRef:
kind: HelmRepository
name: weaveworks-tf-controller
namespace: flux-system
interval: 10m
values:
installCRDs: true
replicaCount: 1
concurrency: 48
resources:
limits:
cpu: 1000m
memory: 1Gi
Namespace Runner SerivceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
meta.helm.sh/release-namespace: reports130
labels:
app.kubernetes.io/managed-by: Helm
name: tf-runner
namespace: reports130
ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
annotations:
meta.helm.sh/release-namespace: reports130
labels:
app.kubernetes.io/managed-by: Helm
name: support-common-infra-reports130-tf-runner
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: tf-runner-role
subjects:
- kind: ServiceAccount
name: tf-runner
namespace: reports130
Terraform resource
apiVersion: infra.contrib.fluxcd.io/v1alpha2
kind: Terraform
metadata:
annotations:
helm.sh/resource-policy: keep
meta.helm.sh/release-namespace: reports130
labels:
app.kubernetes.io/managed-by: Helm
namespace: reports130
resourceType: cdn
tf.weave.works/composite: cdn
name: cdn-reports130-tiles
namespace: reports130
spec:
alwaysCleanupRunnerPod: true
approvePlan: auto
destroyResourcesOnDeletion: true
disableDriftDetection: false
force: false
interval: 15h0m
parallelism: 0
path: terraform/tf-modules/cdn
refreshBeforeApply: false
retryInterval: 20m
runnerPodTemplate:
spec:
envFrom:
- secretRef:
name: aws-information
runnerTerminationGracePeriodSeconds: 30
serviceAccountName: tf-runner
sourceRef:
kind: GitRepository
name: flux-system
namespace: flux-system
storeReadablePlan: none
values:
create: true
workspace: default
writeOutputsToSecret:
name: cdn-reports130-tiles