keda
keda copied to clipboard
KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect
Report
I have a KEDA(V 2.10.1) enabled in an AKS(V 1.26.6) cluster using the helm chart. It created 2 metrics pods But the scaling is not working and only 1 worker pod is scaled for the jobs.
The logs of one of the metric server is giving the ERROR- "grpc: addrConn.createTransport failed to connect". For the other metric server, it is showing as connection established.
Err: connection error: desc = "transport: Error while dialing: dial tcp 10.105.162.91:9666: connect: connection timed out" W1004 22:54:13.289988 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to { "Addr": "keda-operator.kube-system.svc.cluster.local:9666", "ServerName": "keda-operator.kube-system.svc.cluster.local:9666", "Attributes": null, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: Error while dialing: dial tcp XX.XX.XX.XX:9666: connect: connection timed out"
Expected Behavior
The worker pods should scale up to multiple pods as and when the jobs requests increases.
Actual Behavior
The workers pods should scale up as and when the job requests increases.
Steps to Reproduce the Problem
- Installed KEDA(V 2.10.1) in Azure AKS(V 1.26.6) using helm bicep code.
- Setup airflow in the AKS using the HELM chart.
Logs from KEDA operator
example
2023-10-04T23:30:31Z ERROR cert-rotation Webhook not found. Unable to update certificate. {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io "keda-admission" not found"} github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:731 github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:700 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 2023-10-04T23:30:31Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
KEDA Version
2.10.1
Kubernetes Version
1.26
Platform
Microsoft Azure
Scaler Details
No response
Anything else?
No response
@tomkerkhove @JorTurFer This is actually regarding a KEDA installation using the AKS add-on.
You can ignore the missing webhook configuration error, that's an error in the add-on Helm chart that we're already planning on fixing. But I need a bit of help here in understanding the constant timeout that the metric server is undergoing when trying to communicate with the operator.
Let me know what information would be required to diagnose this further and I can provide those and work alongside you.
@v-shenoy thanks for clarification.
Is the timeout message there appearing constantly? Or just during a startup (that's okay), in that case the Metrics Server waits till operator is up. You should see this message in the logs: https://github.com/kedacore/keda/blob/8adb70e97a08a4690613eef4c4f00391e44e1603/pkg/provider/provider.go#L84C38-L84C97
There are two replicas for the metric server. One of them is able to connect successfully, the other one is continuously timing out. We had multiple clusters face this issue. In some of them, restarting the metric server deployment was enough, but not in all.
Do you see errors on KEDA operator pod? That message is printed by the MS because it tries to establish the gRPC connection with the operator for getting metrics (after KEDA 2.9, the metric server is just a proxy for the HPA controller but all the work is done by the operator)
Besides the missing webhook configuration, I don't think we were seeing any other errors in the operator pod. Plus, one of the metric servers did connect successfully. Correct me if I am missing something, @sktemkar.
Any update?
I think the AKS system pods were being throttled due to the size of the system pool nodes being small. @sktemkar increased the nodes size as well as added the CriticalAddonsOnly=true:NoSchedule
taint to the KEDA pods (we have the corresponding toleration enabled in the add-on). It seems to be working for now, but the plan is to monitor for a few more days and see if the issue re-occurs.
Any update on this? can we close the issue?
this issue is fixed after increasing the size of the system node pool, adding critical app taint and redeploying keda configuration.