keda KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect

Report

I have a KEDA(V 2.10.1) enabled in an AKS(V 1.26.6) cluster using the helm chart. It created 2 metrics pods But the scaling is not working and only 1 worker pod is scaled for the jobs.

The logs of one of the metric server is giving the ERROR- "grpc: addrConn.createTransport failed to connect". For the other metric server, it is showing as connection established.

Err: connection error: desc = "transport: Error while dialing: dial tcp 10.105.162.91:9666: connect: connection timed out" W1004 22:54:13.289988 1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to { "Addr": "keda-operator.kube-system.svc.cluster.local:9666", "ServerName": "keda-operator.kube-system.svc.cluster.local:9666", "Attributes": null, "BalancerAttributes": null, "Type": 0, "Metadata": null }. Err: connection error: desc = "transport: Error while dialing: dial tcp XX.XX.XX.XX:9666: connect: connection timed out"

Expected Behavior

The worker pods should scale up to multiple pods as and when the jobs requests increases.

Actual Behavior

The workers pods should scale up as and when the job requests increases.

Steps to Reproduce the Problem

Installed KEDA(V 2.10.1) in Azure AKS(V 1.26.6) using helm bicep code.
Setup airflow in the AKS using the HELM chart.

Logs from KEDA operator

example

2023-10-04T23:30:31Z ERROR cert-rotation Webhook not found. Unable to update certificate. {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io "keda-admission" not found"} github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:731 github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:700 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235 2023-10-04T23:30:31Z INFO cert-rotation Ensuring CA cert {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}

KEDA Version

2.10.1

Kubernetes Version

1.26

Platform

Microsoft Azure

Scaler Details

No response

Anything else?

No response

Oct 04 '23 23:10 sktemkar

@tomkerkhove @JorTurFer This is actually regarding a KEDA installation using the AKS add-on.

You can ignore the missing webhook configuration error, that's an error in the add-on Helm chart that we're already planning on fixing. But I need a bit of help here in understanding the constant timeout that the metric server is undergoing when trying to communicate with the operator.

Let me know what information would be required to diagnose this further and I can provide those and work alongside you.

Oct 05 '23 08:10 v-shenoy

@v-shenoy thanks for clarification.

Is the timeout message there appearing constantly? Or just during a startup (that's okay), in that case the Metrics Server waits till operator is up. You should see this message in the logs: https://github.com/kedacore/keda/blob/8adb70e97a08a4690613eef4c4f00391e44e1603/pkg/provider/provider.go#L84C38-L84C97

Oct 05 '23 08:10 zroubalik

There are two replicas for the metric server. One of them is able to connect successfully, the other one is continuously timing out. We had multiple clusters face this issue. In some of them, restarting the metric server deployment was enough, but not in all.

Oct 05 '23 09:10 v-shenoy

Do you see errors on KEDA operator pod? That message is printed by the MS because it tries to establish the gRPC connection with the operator for getting metrics (after KEDA 2.9, the metric server is just a proxy for the HPA controller but all the work is done by the operator)

Oct 05 '23 10:10 JorTurFer

Besides the missing webhook configuration, I don't think we were seeing any other errors in the operator pod. Plus, one of the metric servers did connect successfully. Correct me if I am missing something, @sktemkar.

Oct 05 '23 13:10 v-shenoy

Any update?

Oct 15 '23 18:10 JorTurFer

I think the AKS system pods were being throttled due to the size of the system pool nodes being small. @sktemkar increased the nodes size as well as added the CriticalAddonsOnly=true:NoSchedule taint to the KEDA pods (we have the corresponding toleration enabled in the add-on). It seems to be working for now, but the plan is to monitor for a few more days and see if the issue re-occurs.

Oct 16 '23 06:10 v-shenoy

Any update on this? can we close the issue?

Dec 11 '23 07:12 JorTurFer

this issue is fixed after increasing the size of the system node pool, adding critical app taint and redeploying keda configuration.

Feb 07 '24 03:02 sktemkar

keda keda copied to clipboard

KEDA not scaling the pods with error grpc: addrConn.createTransport failed to connect

Report

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

keda
keda copied to clipboard