keda
keda copied to clipboard
Autoscaling stopped working because of TLS issue
Report
We use Keda 2.13.1 and CertManager to issue our TLS certs.
We recently noticed that autoscaling of our workloads, via prometheus trigger, stopped working.
The keda created hpa had the following events:
the HPA was unable to compute the replica count: unable to get external
metric
postman/s0-prometheus/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name:
postman,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to
fetch metrics from external metrics API: the server was unable to return
a response in the time allotted, but may still be processing the request
(get s0-prometheus.external.metrics.k8s.io)
The metric itself could be queried without problems.
The logs of all 3 keda-metrics-apiserver pods had errors like this:
W0410 11:34:19.738975 1 logging.go:59] [core] [Channel #1 SubChannel #5] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"x509: invalid signature: parent certificate cannot sign this kind of certificate\" while trying to verify candidate authority certificate \"keda-operator\")"
The certifcate created by the cert-manager issuer was valid and was not renewed recently.
After restarting keda-metrics-apiserver deployment it worked again.
Expected Behavior
Autoscaling works
Actual Behavior
Autoscaling stopped working
Steps to Reproduce the Problem
Not sure
Logs from KEDA operator
example
KEDA Version
2.13.1
Kubernetes Version
1.28
Platform
Microsoft Azure
Scaler Details
Prometheus
Anything else?
No response
hum... It seems that the certificate isn't hot reloaded on changes. I think that we have to add a watcher for the cert files and restart the server in case of changes :/
But im pretty sure the cert did not change. Here the status of the cert:
status:
conditions:
- lastTransitionTime: "2023-12-04T14:08:02Z"
message: Certificate is up to date and has not expired
observedGeneration: 2
reason: Ready
status: "True"
type: Ready
notAfter: "2025-04-04T06:08:03Z"
notBefore: "2024-04-04T06:08:03Z"
renewalTime: "2024-08-03T22:08:03Z"
revision: 2
So the only thing i can think of are pod restarts because of node updates or maybe short k8s api downtime?
Nevertheless some sort of retry or restart might solve the issue anyway.
The problem is that the log you sent says problems validating the cert, not just the connection but the cert, that's the weird part for me :/
Could you check the secret which contains the cert to verify if it was recreated ? (checking the certificate creation time, not just the secret).
I see notBefore: "2024-04-04T06:08:03Z"
so it could have been updated. If you want, you can share the tls.crt (it's not a secret, the secret is the key) and we can check the creation date
It seeems the certificate has indeed been updated some days before the issue and the "renewalTime: "2024-08-03T22:08:03Z" in the certificate resource is misleading.
k -n keda get secrets kedaorg-certs -o yaml | yq e '.data."tls.crt"' | base64 -d | openssl x509 -noout -text| grep Validity -A 2
Validity
Not Before: Apr 4 06:08:03 2024 GMT
Not After : Apr 4 06:08:03 2025 GMT
So i guess the issue than was popping up when only one of the keda-operator or the keda-metrics-apiserver was restartet?
At least we got another issue like that in another cluster today, where the keda-operator was oom killed (and likely used the new cert when restarting) and afterwards the same errors came up again in keda-metrics-apiserver , which were fixed with a keda-metrics-apiserver restart again.