tofu-controller icon indicating copy to clipboard operation
tofu-controller copied to clipboard

tf-runner fails: "terraform-runner.tls-*" not found

Open faganihajizada opened this issue 3 years ago • 7 comments
trafficstars

Hi 👋🏻

We are having issues with the tf-controller. Sometimes the tf-runner pods fail and we see an error message like:

2022/09/28 11:36:38 secrets "terraform-runner.tls-1664811349" not found

This happens in case there was a namespace with the same name before and deleted. If we had a namespace X and deleted it, the next time when we create a new namespace with the name X, the tf-runner in the new namespace fails.

A workaround is to restart the tf-controller deployment.

Any help is highly appreciated. Thanks!

faganihajizada avatar Sep 29 '22 16:09 faganihajizada

https://github.com/weaveworks/tf-controller/blob/main/api/v1alpha1/terraform_types.go#L35-L44

faganihajizada avatar Sep 29 '22 16:09 faganihajizada

Thank you for reporting this @faganihajizada !

chanwit avatar Sep 29 '22 16:09 chanwit

You are welcome @chanwit. Thanks for a great job!

We are assuming the tf-controller stores/caches information about namespaces somewhere but we are not sure how. Could you please provide some details? Maybe we can contribute to fixing it.

faganihajizada avatar Sep 29 '22 16:09 faganihajizada

Seems like it is being stored in memory: https://github.com/weaveworks/tf-controller/blob/279ac91ba1a015c128bfea18d8417e423bb9abea/mtls/rotator.go#L287

faganihajizada avatar Sep 30 '22 13:09 faganihajizada

Yes, it is. We need a mechanism to write it down again to a newly created namespace. To do so, we need a K8s Informer to watch the namespace (re-)creation.

chanwit avatar Sep 30 '22 14:09 chanwit

Actually decreasing the value of CAValidityDuration might help right? For example we make it to refreshCertsInMemory in each and every 5 minutes.

faganihajizada avatar Sep 30 '22 15:09 faganihajizada

Yes, you could try that.

If the CA duration value is 5m and the lookahead time is 2m, the CA will be refreshed for like every 5-2 = 3 minutes.

The flag to trigger re-generation is here: https://github.com/weaveworks/tf-controller/blob/279ac91ba1a015c128bfea18d8417e423bb9abea/mtls/rotator.go#L261

chanwit avatar Sep 30 '22 15:09 chanwit

Still hope to see this addressed.

artem-nefedov avatar Feb 17 '23 11:02 artem-nefedov

Also looking forward to this being fixed. we are having tf-runners run within ephemeral namespaces, and the controller gets confused if the namespace is removed/recreated, as described here.

rattboi avatar Feb 17 '23 21:02 rattboi

Alternatively, an ability to disable certificate management in the controller in favor of using 3rd party tool (cert-manager or just service mesh with mTLS) would be an option too.

artem-nefedov avatar Feb 19 '23 16:02 artem-nefedov

Closing as work over the past few months should have addressed this issue and we expect it to only occur in edge-case situations. Feel free to report a new bug should you encounter one.

lasomethingsomething avatar Nov 08 '23 16:11 lasomethingsomething

Is there an ETA when the next stable release with this fix is expected?

artem-nefedov avatar Dec 07 '23 16:12 artem-nefedov

Is there any approach to re-generate the tls secret manually?

I tried to suspend all the tf object in specific name space and removed crash tf runner pods as well, then restart tf controller->resume tf object, the tls secret regenerated.

wangyi198682 avatar Mar 27 '24 02:03 wangyi198682

Sorry we don't have it yet.

It's automatically controlled by the mTLS generator inside the controller now.

May I ask how would suggest to generate these TLS secrets?

Maybe via a new tfctl command?

chanwit avatar Mar 27 '24 02:03 chanwit