emissary
emissary copied to clipboard
Emissary-apiext pods are not creating the emissary-ingress-webhook-ca secret and hanging on startup
Describe the bug We have deployed Emissary-apiext via Helm chart in the normal manner but the pods are going into a constant CrashLoopBackoff state, the pod logs just display the first two lines below:
The namespace events show this:
It appears to be hanging on the 'emissary-ingress-webhook-ca' secret creation as the secret is never created. If we copy the same secret from another cluster into the namespace, then the pods start fine and the TLS cert is injected into the getambassador.io CRDs spec->webhook->caBundle as expected.
The RBAC seems fine, the 'emissary-apiext'role & rolebinding was deployed as part of the Helm chart that grants the emissary-apiext service account the ability to create this secret and then update it.
To Reproduce Steps to reproduce the behavior:
- Switch into the emissary-system namespace and delete the 'emissary-ingress-webhook-ca' secret
- Restart the emissary-apiext pods - kubectl rollout restart deploy/emissary-apiext
- Watch pods and logs and they will go into the constant crashloopbackoff state
Expected behavior The emissary-apiext pods should start, followed by CA mint & secret creation as per the 'EnsureCA' Go function, followed by getambassador.io CRDs spec->webhook->caBundle injection as per the 'apiext.updateCRD' Go function: https://github.com/emissary-ingress/emissary/blob/master/cmd/apiext/ca.go#L49
Versions (please complete the following information):
- Ambassador: Emissary Ingress apiext v2.4.1
- Kubernetes environment: Native K8s
- Version: v1.21.14
Additional context
We have attempted to turn up the emissary-apiext pod logging level from INFO to TRACE by editing the deployment as per below, this did not result in any more verbosity however. The same two lines were logged as above.
Could you please assist us in narrowing down where the problem lies?
Hi @jmboby , I believe in the emissary-system namespace there are a couple of secrets that get saved there. If you deleted only one of them, I wonder if that is somehow blocking the creation of the new secret. That's why we've suggested deleting everything in that namespace as mentioned [here in our docs].(https://www.getambassador.io/docs/emissary/latest/topics/install/yaml-install#install-with-yaml). _All users who are running Emissary-ingress/Ambassador Edge Stack 2.x or 3.x with the apiext service should proactively renew their certificate as soon as practical by running kubectl delete --all secrets --namespace=emissary-system to delete the existing certificate Could you check if any other old secrets are hanging around?
Thanks for the response @cindymullins-dw, I have tested that theory by first uninstalling the helm chart for emissary-apiext since it also stores secrets in the emissary-system namespace to track helm release versions, I then deleted the emissary-ingress-webhook-ca secret and confirmed no other secrets were present. Following this I re-installed the emissary-apiext helm chart again but the same error occurred - the pods are not starting properly and the secret is not created.
Are we able to somehow increase logging verbosity on your pod Go function calls so we can see exactly where its failing? The last line logged is: time="2023-02-26 21:21:53.4975" level=info msg="APIEXT_LOGLEVEL=info" func=github.com/datawire/ambassador/v2/cmd/apiext.Run file="/go/cmd/apiext/main.go:63" CMD=apiext PID=1
As mentioned we did add the APIEXT_LOGLEVEL parm with value of TRACE to the pod env and this did take effect after the pod was up and running, we see debug level logging on all CRD conversions which is great but we really need to see more verbosity on the pod startup...
Hi @cindymullins-dw, we ended up fixing by adding a startup probe (below) to the emissary-apiext deployment to allow them the time needed to fully start and create the TLS cert & secret etc. The liveness probe was killing off the pods too early. If the secret is already present in the namespace then they start quite quickly and the issue is not noticed.
Are the pods making some kind of long running K8 API call perhaps when the secret is not present? Or could the TLS cert generation taking a long time? I guess this goes back to increasing log verbosity on startup in order to prove.
startupProbe:
httpGet:
scheme: HTTP
path: /probes/live
port: 8080
failureThreshold: 5
periodSeconds: 10
Just to clarify, you deleted the 'emissary-ingress-webhook-ca' secret because the pods went into into CrashLoopBackoff state? (I ask because we are proactively recommending users manually delete and re-create their apiext certs before they expire at the 1 year mark). However it sounds like you encountered this issue on a normal install?
Yes the issue we encountered is separate to your 1yr expiry, our issue is that the pods take a long time to startup when the secret is not present and yes this is on a fresh install of emissary-apiext.
Ok, thanks for the confirmation & detail. I'll leave this open as a feature request for a start up controller / better logging around the start up process.
This happened for me also today. I tried to upgrade emissary to 3.7.1 and for that I upgraded the CRDs. After that emissary-apiext pods complained about below error and went in CrashLoopBackOff state even though the secret emissary-ingress-webhook-ca was already present and it was not more than a year old.
time="2023-07-24 11:52:07.4538" level=info msg="Emissary Ingress apiext (version \"3.7.1\")" func=github.com/emissary-ingress/emissary/v3/cmd/apiext.Main file="/go/cmd/apiext/main.go:41" CMD=apiext PID=1 time="2023-07-24 11:52:07.4549" level=info msg="APIEXT_LOGLEVEL=info" func=github.com/emissary-ingress/emissary/v3/cmd/apiext.Run file="/go/cmd/apiext/main.go:59" CMD=apiext PID=1 time="2023-07-24 11:52:17.5742" level=error msg="shut down with error error: Get \"https://10.96.0.1:443/api/v1/namespaces/emissary-system/secrets/emissary-ingress-webhook-ca\": net/http: TLS handshake timeout" func=github.com/emissary-ingress/emissary/v3/pkg/busy.Main file="/go/pkg/busy/busy.go:87" CMD=apiext PID=1
I tired increasing the timeouts for probes but nothing worked out. I also deleted the secret manually thinking the new pods will create them but they did not and above was the error I keep on getting. Is there anything I need to do to get pass this error?
@cindymullins-dw After the upgrade 3.9.1 we've started experiencing the same issue, every 2-3 days the Emissary pods lose connectivity to the emissary-apiext pods because of the cert and we end up recreating it