tofu-controller
tofu-controller copied to clipboard
tf-runner fails: "terraform-runner.tls-*" not found
Hi 👋🏻
We are having issues with the tf-controller. Sometimes the tf-runner pods fail and we see an error message like:
2022/09/28 11:36:38 secrets "terraform-runner.tls-1664811349" not found
This happens in case there was a namespace with the same name before and deleted. If we had a namespace X and deleted it, the next time when we create a new namespace with the name X, the tf-runner in the new namespace fails.
A workaround is to restart the tf-controller deployment.
Any help is highly appreciated. Thanks!
https://github.com/weaveworks/tf-controller/blob/main/api/v1alpha1/terraform_types.go#L35-L44
Thank you for reporting this @faganihajizada !
You are welcome @chanwit. Thanks for a great job!
We are assuming the tf-controller stores/caches information about namespaces somewhere but we are not sure how. Could you please provide some details? Maybe we can contribute to fixing it.
Seems like it is being stored in memory: https://github.com/weaveworks/tf-controller/blob/279ac91ba1a015c128bfea18d8417e423bb9abea/mtls/rotator.go#L287
Yes, it is. We need a mechanism to write it down again to a newly created namespace. To do so, we need a K8s Informer to watch the namespace (re-)creation.
Actually decreasing the value of CAValidityDuration might help right? For example we make it to refreshCertsInMemory in each and every 5 minutes.
Yes, you could try that.
If the CA duration value is 5m and the lookahead time is 2m, the CA will be refreshed for like every 5-2 = 3 minutes.
The flag to trigger re-generation is here: https://github.com/weaveworks/tf-controller/blob/279ac91ba1a015c128bfea18d8417e423bb9abea/mtls/rotator.go#L261
Still hope to see this addressed.
Also looking forward to this being fixed. we are having tf-runners run within ephemeral namespaces, and the controller gets confused if the namespace is removed/recreated, as described here.
Alternatively, an ability to disable certificate management in the controller in favor of using 3rd party tool (cert-manager or just service mesh with mTLS) would be an option too.
Closing as work over the past few months should have addressed this issue and we expect it to only occur in edge-case situations. Feel free to report a new bug should you encounter one.
Is there an ETA when the next stable release with this fix is expected?
Is there any approach to re-generate the tls secret manually?
I tried to suspend all the tf object in specific name space and removed crash tf runner pods as well, then restart tf controller->resume tf object, the tls secret regenerated.
Sorry we don't have it yet.
It's automatically controlled by the mTLS generator inside the controller now.
May I ask how would suggest to generate these TLS secrets?
Maybe via a new tfctl command?