keda icon indicating copy to clipboard operation
keda copied to clipboard

Keda operator memory leak when using SQS Scaler

Open sushama-kothawale opened this issue 3 months ago • 26 comments

keda operator pod increasing memory indefinitely till OOM when using SQS Scaler.

Expected Behavior

Keda should be working properly.

Actual Behavior

Keda operator pods going OOM causing multiple pod restarts of keda-operator.

k get pods -n keda NAME READY STATUS RESTARTS AGE keda-admission-webhooks-75985fff96-5b4bj 1/1 Running 0 23h keda-operator-55f9c6997d-c5lb7 0/1 Running 4 (54s ago) 6m24s keda-operator-55f9c6997d-d7b9r 0/1 Running 1 (22s ago) 3m39s keda-operator-55f9c6997d-dqn8l 0/1 CrashLoopBackOff 5 (78s ago) 11m keda-operator-55f9c6997d-j8dzg 1/1 Running 2 (108s ago) 4m28s keda-operator-55f9c6997d-zrcp5 1/1 Running 4 (2m18s ago) 11m keda-operator-metrics-apiserver-86448cf44-5xbhg 1/1 Running 0 23h keda-operator-metrics-apiserver-86448cf44-hvl69 1/1 Running 0 44m

Steps to Reproduce the Problem

  1. Install keda helm chart [tested for 2.12.0, 2.12.1, 2.13.0, 2.13.1]
  2. Now you can see keda operator working properly , monitor it for 10-15 mins..
  3. Enable scaledobject with below sqs scaler config:
triggers:
- metadata:
    value: '80'
  metricType: Utilization
  type: memory
- type: aws-sqs-queue
  authenticationRef:
    name: keda-trigger-auth-aws-credentials-irsa
  metadata:
    queueURL: https://sqs.us-east-1.amazonaws.com/<account_id>/prod-us-east-1_mq_nh_immediate
    queueLength: "500"
    awsRegion: "us-east-1"
- type: aws-sqs-queue
  authenticationRef:
    name: keda-trigger-auth-aws-credentials-irsa
  metadata:
    queueURL: https://sqs.us-east-1.amazonaws.com/<account_id>/prod-us-east-1_mq_nh_actnmediate
    queueLength: "500"
    awsRegion: "us-east-1"
  1. Once we deploy abovr scaledobject keda operator goes OOM more frequently. We tried to allocate more resources i.e. 4gb to 8gb, still keda operator goes OOM indefinately.
  2. Below are k8s pod event:

Containers: keda-operator: Container ID: containerd://87b2b423d007d1fb948ff21b3ef0836bf073ad6c09a18159a8c3f89023f29e57 Image: ghcr.io/kedacore/keda:2.13.0 Image ID: ghcr.io/kedacore/keda@sha256:4b1d7c1a0f58175e212939aa440e6c6b4a1815d471390fc1f27bf9b6e9598c71 Port: 9666/TCP Host Port: 0/TCP Command: /keda Args: --leader-elect --disable-compression=true --zap-log-level=info --zap-encoder=console --zap-time-encoding=rfc3339 --cert-dir=/certs --enable-cert-rotation=true --cert-secret-name=kedaorg-certs --operator-service-name=keda-operator --metrics-server-service-name=keda-operator-metrics-apiserver --webhooks-service-name=keda-admission-webhooks --k8s-cluster-name=kubernetes-default --k8s-cluster-domain=cluster.local --enable-prometheus-metrics=false State: Running Started: Tue, 09 Apr 2024 11:40:18 +0530 Last State: Terminated Reason: OOMKilled Exit Code: 137

Specifications

  • KEDA Version: 2.13.1
  • Platform & Version: AWS EKS
  • Kubernetes Version: 1.28
  • Scaler(s): SQS

Attching the memory usage snapsht from grafana , memory gets spikes as soon as SQS scaler is added Screenshot 2024-04-09 at 11 59 38 AM

@JorTurFer Can you please check this? Due to this issue we are blocked for few apps deployment integration of keda on prod environment.

sushama-kothawale avatar Apr 09 '24 06:04 sushama-kothawale

Hello! Do you see any error in keda operator logs?

JorTurFer avatar Apr 10 '24 21:04 JorTurFer

@JorTurFer As the keda-operator is going into crashloopbackoff state i.e. resrting due to OOM. Do not see any error logs there, but in keda-metrics server I see below logs:

g/server/filters.(*timeoutHandler).ServeHTTP in goroutine 407499 /workspace/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:101 +0x1b2 W0408 05:23:19.046791 1 logging.go:59] [core] [Channel #1 SubChannel #85] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.125.254:9666: connect: connection refused" I0408 05:24:04.961431 1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server" I0408 05:24:25.534490 1 client.go:88] keda_metrics_adapter/provider

sushama-kothawale avatar Apr 11 '24 05:04 sushama-kothawale

The problem is the KEDA operator pod in this case, so we need those logs. What about previous container logs?

JorTurFer avatar Apr 11 '24 06:04 JorTurFer

How many ScaledObect do you have?

JorTurFer avatar Apr 11 '24 06:04 JorTurFer

Currently we have only 2 scaledobjects, 1 is for cpu and memory based scaling and another has SQs scaler added.

> k get scaledobject -A
> NAMESPACE   NAME              SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   TRIGGERS   AUTHENTICATION                           READY     ACTIVE    FALLBACK   PAUSED    AGE
> staging     dtl-dlm-pythonb   apps/v1.Deployment   dtl-dlm-pythonb   1     5     cpu                                                 True      False     False      Unknown   2d21h
> staging     ntf-svc-nh-c      apps/v1.Deployment   ntf-svc-nh-c      1     2     memory     keda-trigger-auth-aws-credentials-irsa   Unknown   Unknown   Unknown    Unknown   27s

Here are keda pod status:

k get pods -n keda                                             
NAME                                               READY   STATUS             RESTARTS        AGE
keda-admission-webhooks-595bf74cdf-swjdr           1/1     Running            0               2d21h
keda-operator-865595df44-r4p8h                     0/1     CrashLoopBackOff   6 (119s ago)    94m
keda-operator-865595df44-xjk47                     0/1     CrashLoopBackOff   6 (2m39s ago)   3h39m
keda-operator-metrics-apiserver-84f9498d4d-6h26r   1/1     Running            0               176m
keda-operator-metrics-apiserver-84f9498d4d-k5xqs   1/1     Running            0               27h

Here are the keda previous container logs:

> k logs keda-operator-865595df44-r4p8h  -n keda -p keda-operator
> 2024/04/11 07:23:08 maxprocs: Updating GOMAXPROCS=2: determined from CPU quota
> 2024-04-11T07:23:08Z	INFO	setup	Starting manager
> 2024-04-11T07:23:08Z	INFO	setup	KEDA Version: 2.13.0
> 2024-04-11T07:23:08Z	INFO	setup	Git Commit: 367fcd32f1e72be504ada53165d07c33d75fc0f7
> 2024-04-11T07:23:08Z	INFO	setup	Go Version: go1.21.6
> 2024-04-11T07:23:08Z	INFO	setup	Go OS/Arch: linux/amd64
> 2024-04-11T07:23:08Z	INFO	setup	Running on Kubernetes 1.28+	{"version": "v1.28.7-eks-b9c9ed7"}
> 2024-04-11T07:23:08Z	INFO	starting server	{"kind": "health probe", "addr": "[::]:8081"}
> I0411 07:23:08.977079       1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
> I0411 07:23:25.828622       1 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh
> 2024-04-11T07:23:25Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
> 2024-04-11T07:23:25Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
> 2024-04-11T07:23:25Z	INFO	Starting Controller	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
> 2024-04-11T07:23:25Z	INFO	Starting EventSource	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
> 2024-04-11T07:23:25Z	INFO	Starting Controller	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
> 2024-04-11T07:23:25Z	INFO	Starting EventSource	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
> 2024-04-11T07:23:25Z	INFO	Starting Controller	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
> 2024-04-11T07:23:25Z	INFO	Starting EventSource	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "source": "kind source: *v1alpha1.CloudEventSource"}
> 2024-04-11T07:23:25Z	INFO	Starting Controller	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
> 2024-04-11T07:23:25Z	INFO	Starting EventSource	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
> 2024-04-11T07:23:25Z	INFO	cert-rotation	starting cert rotator controller
> 2024-04-11T07:23:25Z	INFO	Starting Controller	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
> 2024-04-11T07:23:25Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
> 2024-04-11T07:23:25Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
> 2024-04-11T07:23:25Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
> 2024-04-11T07:23:25Z	INFO	Starting Controller	{"controller": "cert-rotator"}
> 2024-04-11T07:23:25Z	INFO	cert-rotation	no cert refresh needed
> 2024-04-11T07:23:25Z	INFO	Starting workers	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
> 2024-04-11T07:23:25Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "07b413e1-63a5-44aa-bbff-c4117d7b8be8"}
> 2024-04-11T07:23:25Z	INFO	Starting workers	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
> 2024-04-11T07:23:25Z	INFO	Starting workers	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
> 2024-04-11T07:23:25Z	INFO	cert-rotation	certs are ready in /certs
> 2024-04-11T07:23:25Z	INFO	Starting workers	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
> 2024-04-11T07:23:25Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2e5589f4-e024-4cdb-b398-b14fa69cde73"}
> 2024-04-11T07:23:25Z	INFO	Starting workers	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "worker count": 1}
> 2024-04-11T07:23:25Z	INFO	Detected resource targeted for scaling	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "07b413e1-63a5-44aa-bbff-c4117d7b8be8", "resource": "apps/v1.Deployment", "name": "dtl-dlm-pythonb"}
> 2024-04-11T07:23:25Z	INFO	Starting workers	{"controller": "cert-rotator", "worker count": 1}
> 2024-04-11T07:23:25Z	INFO	cert-rotation	no cert refresh needed
> 2024-04-11T07:23:25Z	INFO	cert-rotation	Ensuring CA cert	{"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
> 2024-04-11T07:23:25Z	INFO	Detected resource targeted for scaling	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2e5589f4-e024-4cdb-b398-b14fa69cde73", "resource": "apps/v1.Deployment", "name": "ntf-svc-nh-c"}
> 2024-04-11T07:23:25Z	INFO	Creating a new HPA	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2e5589f4-e024-4cdb-b398-b14fa69cde73", "HPA.Namespace": "staging", "HPA.Name": "keda-hpa-ntf-svc-nh-c"}
> 2024-04-11T07:23:25Z	INFO	cert-rotation	Ensuring CA cert	{"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
> 2024-04-11T07:23:25Z	INFO	cert-rotation	no cert refresh needed
> 2024-04-11T07:23:25Z	INFO	cert-rotation	Ensuring CA cert	{"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
> 2024-04-11T07:23:26Z	INFO	cert-rotation	Ensuring CA cert	{"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
> 2024-04-11T07:23:27Z	INFO	cert-rotation	CA certs are injected to webhooks
> 2024-04-11T07:23:27Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
> 2024-04-11T07:23:27Z	INFO	Initializing Scaling logic according to ScaledObject Specification	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "07b413e1-63a5-44aa-bbff-c4117d7b8be8"}
> 2024-04-11T07:23:27Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "943c5385-e3b4-4740-9d95-95ba12c15c8a"}
> 2024-04-11T07:23:27Z	INFO	Detected resource targeted for scaling	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "943c5385-e3b4-4740-9d95-95ba12c15c8a", "resource": "apps/v1.Deployment", "name": "dtl-dlm-pythonb"}

keda-metrics server logs:

> k logs keda-operator-metrics-apiserver-84f9498d4d-6h26r -n keda
> 2024/04/11 04:35:21 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
> I0411 04:35:22.152840       1 welcome.go:34] keda_metrics_adapter "msg"="Starting metrics server" 
> I0411 04:35:22.152863       1 welcome.go:35] keda_metrics_adapter "msg"="KEDA Version: 2.13.0" 
> I0411 04:35:22.152875       1 welcome.go:36] keda_metrics_adapter "msg"="Git Commit: 367fcd32f1e72be504ada53165d07c33d75fc0f7" 
> I0411 04:35:22.152888       1 welcome.go:37] keda_metrics_adapter "msg"="Go Version: go1.21.6" 
> I0411 04:35:22.152900       1 welcome.go:38] keda_metrics_adapter "msg"="Go OS/Arch: linux/amd64" 
> I0411 04:35:22.152962       1 welcome.go:39] keda_metrics_adapter "msg"="Running on Kubernetes 1.28+" "version"={"major":"1","minor":"28+","gitVersion":"v1.28.7-eks-b9c9ed7","gitCommit":"217b17b5aed5ddc370672ff6c12d78037cbb8e7a","gitTreeState":"clean","buildDate":"2024-03-02T03:47:14Z","goVersion":"go1.21.7","compiler":"gc","platform":"linux/amd64"}
> I0411 04:35:22.153683       1 main.go:125] keda_metrics_adapter "msg"="Connecting Metrics Service gRPC client to the server" "address"="keda-operator.keda.svc.cluster.local:9666"
> I0411 04:35:22.177466       1 provider.go:58] keda_metrics_adapter/provider "msg"="starting" 
> I0411 04:35:22.177522       1 main.go:259] keda_metrics_adapter "msg"="starting adapter..." 
> I0411 04:35:22.179088       1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server" 
> I0411 04:35:22.179128       1 main.go:176] keda_metrics_adapter "msg"="starting /metrics server endpoint" 
> W0411 04:35:22.436600       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> I0411 04:35:22.933054       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/certs/tls.crt::/certs/tls.key"
> I0411 04:35:22.933336       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
> I0411 04:35:22.933358       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
> I0411 04:35:22.934083       1 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/certs/ca.crt"
> I0411 04:35:22.934213       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
> I0411 04:35:22.934227       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
> I0411 04:35:22.935197       1 secure_serving.go:213] Serving securely on [::]:6443
> I0411 04:35:22.935257       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
> I0411 04:35:23.033444       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
> I0411 04:35:23.034803       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
> W0411 04:35:23.444580       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 04:35:25.230358       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> I0411 04:35:28.697084       1 provider.go:66] keda_metrics_adapter/provider "msg"="Connection to KEDA Metrics Service gRPC server has been successfully established" "server"="keda-operator.keda.svc.cluster.local:9666"
> I0411 04:45:15.807582       1 trace.go:236] Trace[643673315]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:49d6f012-da42-466f-819c-269e60d8a349,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 04:45:15.199) (total time: 608ms):
> Trace[643673315]: ---"Listing from storage done" 607ms (04:45:15.807)
> Trace[643673315]: [608.170519ms] [608.170519ms] END
> I0411 05:16:56.619705       1 trace.go:236] Trace[644596009]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:0c46820e-d630-421c-b20d-f79061b2e28e,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:16:54.699) (total time: 1919ms):
> Trace[644596009]: ---"Listing from storage done" 1919ms (05:16:56.619)
> Trace[644596009]: [1.919699701s] [1.919699701s] END
> I0411 05:26:32.863772       1 trace.go:236] Trace[1355079054]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:f281c6b5-69ad-4456-a9c2-4fba9f70c0c2,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:26:32.350) (total time: 513ms):
> Trace[1355079054]: ---"Listing from storage done" 513ms (05:26:32.863)
> Trace[1355079054]: [513.637067ms] [513.637067ms] END
> I0411 05:44:01.477798       1 trace.go:236] Trace[922117958]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:87df784c-3234-416c-97bd-17824c9cf0ab,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:44:00.907) (total time: 570ms):
> Trace[922117958]: ---"Listing from storage done" 570ms (05:44:01.477)
> Trace[922117958]: [570.608793ms] [570.608793ms] END
> I0411 05:49:36.374413       1 trace.go:236] Trace[1657932732]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:359dce89-e345-4722-adbc-7d77117fade3,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:49:35.792) (total time: 581ms):
> Trace[1657932732]: ---"Listing from storage done" 581ms (05:49:36.374)
> Trace[1657932732]: [581.756559ms] [581.756559ms] END
> I0411 05:51:07.990366       1 trace.go:236] Trace[239221352]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:4c292ca4-0fba-420f-b448-fa1bde5c05ff,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:51:07.410) (total time: 580ms):
> Trace[239221352]: ---"Listing from storage done" 579ms (05:51:07.990)
> Trace[239221352]: [580.10165ms] [580.10165ms] END
> I0411 05:57:19.533902       1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server" 
> I0411 05:57:20.236923       1 trace.go:236] Trace[1925121293]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:f9575a56-81a5-47de-921e-b400aa30e159,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 05:57:19.533) (total time: 703ms):
> Trace[1925121293]: ---"Listing from storage done" 702ms (05:57:20.236)
> Trace[1925121293]: [703.04749ms] [703.04749ms] END
> I0411 06:11:02.377385       1 trace.go:236] Trace[1044216767]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:779a2f79-2716-44d8-9e5c-a9826c31e329,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:11:01.794) (total time: 582ms):
> Trace[1044216767]: ---"Listing from storage done" 581ms (06:11:02.375)
> Trace[1044216767]: [582.313639ms] [582.313639ms] END
> I0411 06:22:10.802200       1 trace.go:236] Trace[793805765]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:09c2c32c-ea08-44bc-8c2b-236fd83169ac,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:22:10.189) (total time: 612ms):
> Trace[793805765]: ---"Listing from storage done" 612ms (06:22:10.802)
> Trace[793805765]: [612.320822ms] [612.320822ms] END
> I0411 06:30:17.878940       1 trace.go:236] Trace[1231255657]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:fa2585f1-15b9-4b60-a255-36b1ba92c92b,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:30:15.815) (total time: 2063ms):
> Trace[1231255657]: ---"Listing from storage done" 2063ms (06:30:17.878)
> Trace[1231255657]: [2.06330352s] [2.06330352s] END
> I0411 06:37:25.510212       1 trace.go:236] Trace[805102546]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:ee38a2f0-35dc-4e14-8d5b-1f33ea94de6e,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:37:24.949) (total time: 560ms):
> Trace[805102546]: ---"Listing from storage done" 560ms (06:37:25.509)
> Trace[805102546]: [560.520059ms] [560.520059ms] END
> I0411 06:52:36.208355       1 trace.go:236] Trace[1665426851]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:e4487d9f-7d98-4f19-9559-3c1d783c66fe,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 06:52:35.640) (total time: 567ms):
> Trace[1665426851]: ---"Listing from storage done" 567ms (06:52:36.208)
> Trace[1665426851]: [567.6487ms] [567.6487ms] END
> I0411 07:19:56.653991       1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server" 
> I0411 07:19:57.344252       1 trace.go:236] Trace[1664014102]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:26bfc4dd-bd9f-4b5e-94d4-23ff2cdc04a0,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 07:19:56.653) (total time: 690ms):
> Trace[1664014102]: ---"Listing from storage done" 690ms (07:19:57.344)
> Trace[1664014102]: [690.301606ms] [690.301606ms] END
> I0411 07:20:56.166708       1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server" 
> I0411 07:20:56.838948       1 trace.go:236] Trace[271217008]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:f78e8767-fa07-4dbd-a504-344560dadfc1,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 07:20:56.166) (total time: 672ms):
> Trace[271217008]: ---"Listing from storage done" 672ms (07:20:56.838)
> Trace[271217008]: [672.262607ms] [672.262607ms] END
> I0411 07:21:36.861992       1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server" 
> W0411 07:21:36.868593       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:21:37.871959       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:21:39.519084       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:21:41.619342       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:21:46.149051       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> I0411 07:21:52.561806       1 trace.go:236] Trace[346857927]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:9fd2f6c1-cff8-41c9-872b-4b5657db0a4e,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 07:21:36.861) (total time: 15699ms):
> Trace[346857927]: ---"Listing from storage done" 15699ms (07:21:52.561)
> Trace[346857927]: [15.699852644s] [15.699852644s] END
> I0411 07:22:10.441735       1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server" 
> W0411 07:22:10.453372       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:11.458402       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:12.849287       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:14.909921       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:19.246492       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:27.023611       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:36.183544       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:22:56.137941       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> E0411 07:23:10.506869       1 provider.go:91] keda_metrics_adapter/provider "msg"="timeout" "error"="timeout while waiting to establish gRPC connection to KEDA Metrics Service server" "server"="keda-operator.keda.svc.cluster.local:9666"
> I0411 07:23:10.506985       1 trace.go:236] Trace[1755005721]: "List" accept:application/vnd.kubernetes.protobuf, */*,audit-id:04f1837a-d0db-4408-86ae-97ae6306d089,client:172.16.108.42,protocol:HTTP/2.0,resource:s2-prometheus,scope:namespace,url:/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus,user-agent:kube-controller-manager/v1.28.7 (linux/amd64) kubernetes/217b17b/system:serviceaccount:kube-system:horizontal-pod-autoscaler,verb:LIST (11-Apr-2024 07:22:10.441) (total time: 60065ms):
> Trace[1755005721]: [1m0.065285782s] [1m0.065285782s] END
> E0411 07:23:10.507391       1 timeout.go:142] post-timeout activity - time-elapsed: 71.754312ms, GET "/apis/external.metrics.k8s.io/v1beta1/namespaces/staging/s2-prometheus" result: runtime error: invalid memory address or nil pointer dereference
> goroutine 61560 [running]:
> k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1.1()
> 	/workspace/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:110 +0x9c
> panic({0x26453c0?, 0x476d830?})
> 	/usr/local/go/src/runtime/panic.go:914 +0x21f
> sigs.k8s.io/custom-metrics-apiserver/pkg/registry/external_metrics.(*REST).List(0xc000860e10, {0x2fa2630, 0xc0014bec90}, 0x0?)
> 	/workspace/vendor/sigs.k8s.io/custom-metrics-apiserver/pkg/registry/external_metrics/reststorage.go:92 +0x112
> k8s.io/apiserver/pkg/endpoints/handlers.ListResource.func1({0x2f93740?, 0xc001223fc0}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/handlers/get.go:278 +0xe9c
> sigs.k8s.io/custom-metrics-apiserver/pkg/apiserver/installer.(*EMHandlers).registerResourceHandlers.restfulListResource.func1(0xc001223fa0, 0xc00061f570)
> 	/workspace/vendor/sigs.k8s.io/custom-metrics-apiserver/pkg/apiserver/installer/installer.go:274 +0x6a
> sigs.k8s.io/custom-metrics-apiserver/pkg/apiserver/installer.(*EMHandlers).registerResourceHandlers.InstrumentRouteFunc.func2(0xc001223fa0, 0xc00061f570)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/metrics/metrics.go:571 +0x1ce
> github.com/emicklei/go-restful/v3.(*Container).dispatch(0xc000b6ec60, {0x2f93740, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/github.com/emicklei/go-restful/v3/container.go:299 +0x5f0
> github.com/emicklei/go-restful/v3.(*Container).Dispatch(...)
> 	/workspace/vendor/github.com/emicklei/go-restful/v3/container.go:204
> k8s.io/apiserver/pkg/server.director.ServeHTTP({{0x2ae7a39?, 0x2976b00?}, 0xc000b6ec60?, 0xc0004f0d20?}, {0x2f93740, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/server/handler.go:146 +0x56c
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func21({0x2f93740?, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
> net/http.HandlerFunc.ServeHTTP(0x2fa2630?, {0x2f93740?, 0xc001223d60?}, 0x4?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filters.withAuthorization.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filters/authorization.go:78 +0x639
> net/http.HandlerFunc.ServeHTTP(0xc17e01c89a01ca85?, {0x2f93740?, 0xc001223d60?}, 0xc0006c58b8?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192
> net/http.HandlerFunc.ServeHTTP(0x47875e0?, {0x2f93740?, 0xc001223d60?}, 0x4?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/server/filters/maxinflight.go:196 +0x262
> net/http.HandlerFunc.ServeHTTP(0x192bf2f?, {0x2f93740?, 0xc001223d60?}, 0x2f6b130?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func23({0x2f93740?, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
> net/http.HandlerFunc.ServeHTTP(0x1?, {0x2f93740?, 0xc001223d60?}, 0x1?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.WithImpersonation.func4({0x2f93740, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filters/impersonation.go:50 +0x1c3
> net/http.HandlerFunc.ServeHTTP(0x812080?, {0x2f93740?, 0xc001223d60?}, 0x0?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192
> net/http.HandlerFunc.ServeHTTP(0x192bf2f?, {0x2f93740?, 0xc001223d60?}, 0x2f6b130?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func24({0x2f93740?, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
> net/http.HandlerFunc.ServeHTTP(0x812080?, {0x2f93740?, 0xc001223d60?}, 0x2?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:84 +0x192
> net/http.HandlerFunc.ServeHTTP(0x192bf2f?, {0x2f93740?, 0xc001223d60?}, 0x2f6b130?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server.DefaultBuildHandlerChain.TrackCompleted.trackCompleted.func26({0x2f93740?, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:110 +0x177
> net/http.HandlerFunc.ServeHTTP(0x2fa2630?, {0x2f93740?, 0xc001223d60?}, 0x2f651f8?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filters.withAuthentication.func1({0x2f93740, 0xc001223d60}, 0xc000d85400)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filters/authentication.go:120 +0x7e5
> net/http.HandlerFunc.ServeHTTP(0x2fa2630?, {0x2f93740?, 0xc001223d60?}, 0x2f6b130?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/endpoints/filterlatency.trackStarted.func1({0x2f93740, 0xc001223d60}, 0xc000d85200)
> 	/workspace/vendor/k8s.io/apiserver/pkg/endpoints/filterlatency/filterlatency.go:94 +0x37a
> net/http.HandlerFunc.ServeHTTP(0x0?, {0x2f93740?, 0xc001223d60?}, 0x476d6b0?)
> 	/usr/local/go/src/net/http/server.go:2136 +0x29
> k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP.func1()
> 	/workspace/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:115 +0x62
> created by k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP in goroutine 61559
> 	/workspace/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:101 +0x1b2
> W0411 07:23:22.835977       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:24:02.456999       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"
> W0411 07:25:21.600341       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.keda.svc.cluster.local:9666", ServerName: "keda-operator.keda.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.20.75.224:9666: connect: connection refused"

Pod evets:

>  State:          Waiting
>       Reason:       CrashLoopBackOff
>     Last State:     Terminated
>       Reason:       OOMKilled
>       Exit Code:    137
>       Started:      Thu, 11 Apr 2024 12:55:10 +0530
>       Finished:     Thu, 11 Apr 2024 12:55:36 +0530
>     Ready:          False
>     Restart Count:  5
>     Limits:
>       cpu:     2
>       memory:  4Gi
>     Requests:
>       cpu:      500m
>       memory:   1Gi
>     Liveness:   http-get http://:8081/healthz delay=25s timeout=1s period=10s #success=1 #failure=3
>     Readiness:  http-get http://:8081/readyz delay=20s timeout=1s period=3s #success=1 #failure=3
>     Environment:
>       WATCH_NAMESPACE:              
>       POD_NAME:                     keda-operator-865595df44-r4p8h (v1:metadata.name)
>       POD_NAMESPACE:                keda (v1:metadata.namespace)
>       OPERATOR_NAME:                keda-operator
>       KEDA_HTTP_DEFAULT_TIMEOUT:    3000
>       KEDA_HTTP_MIN_TLS_VERSION:    TLS12
>       AWS_STS_REGIONAL_ENDPOINTS:   regional
>       AWS_DEFAULT_REGION:           ap-south-1
>       AWS_REGION:                   ap-south-1
>       AWS_ROLE_ARN:                 arn:aws:iam::845925808232:role/mtDevopsKedaReadRole
>       AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
>     Mounts:
>       /certs from certificates (ro)
>       /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
>       /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-m2s7l (ro)
> Conditions:
>   Type              Status
>   Initialized       True 
>   Ready             False 
>   ContainersReady   False 
>   PodScheduled      True 
> Volumes:
>   aws-iam-token:
>     Type:                    Projected (a volume that contains injected data from multiple sources)
>     TokenExpirationSeconds:  86400
>   certificates:
>     Type:        Secret (a volume populated by a Secret)
>     SecretName:  kedaorg-certs
>     Optional:    true
>   kube-api-access-m2s7l:
>     Type:                    Projected (a volume that contains injected data from multiple sources)
>     TokenExpirationSeconds:  3607
>     ConfigMapName:           kube-root-ca.crt
>     ConfigMapOptional:       <nil>
>     DownwardAPI:             true
> QoS Class:                   Burstable
> Node-Selectors:              kubernetes.io/os=linux
> Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
>                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
> Events:
>   Type     Reason     Age                   From     Message
>   ----     ------     ----                  ----     -------
>   Warning  Unhealthy  6m40s                 kubelet  Readiness probe failed: Get "http://10.83.52.218:8081/readyz": dial tcp 10.83.52.218:8081: connect: connection refused
>   Normal   Pulled     6m39s                 kubelet  Successfully pulled image "ghcr.io/kedacore/keda:2.13.0" in 745ms (745ms including waiting)
>   Warning  Unhealthy  5m40s                 kubelet  Readiness probe failed: Get "http://10.83.52.218:8081/readyz": read tcp 10.83.63.130:59102->10.83.52.218:8081: read: connection reset by peer
>   Normal   Pulled     5m22s                 kubelet  Successfully pulled image "ghcr.io/kedacore/keda:2.13.0" in 752ms (752ms including waiting)
>   Warning  Unhealthy  4m51s                 kubelet  Readiness probe failed: Get "http://10.83.52.218:8081/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
>   Normal   Pulling    4m18s (x4 over 89m)   kubelet  Pulling image "ghcr.io/kedacore/keda:2.13.0"
>   Normal   Created    4m17s (x4 over 88m)   kubelet  Created container keda-operator
>   Normal   Started    4m17s (x4 over 88m)   kubelet  Started container keda-operator
>   Normal   Pulled     4m17s                 kubelet  Successfully pulled image "ghcr.io/kedacore/keda:2.13.0" in 748ms (748ms including waiting)
>   Warning  BackOff    93s (x19 over 5m39s)  kubelet  Back-off restarting failed container keda-operator in pod keda-operator-865595df44-r4p8h_keda(4d548af6-80fe-43c5-806f-b8f80eed33f7)

Let me know if you need any more details. Thanks @JorTurFer for looking into it.

sushama-kothawale avatar Apr 11 '24 07:04 sushama-kothawale

Are you using latest version of KEDA, right? Could you extract a memory profile and share it please? https://keda.sh/docs/2.13/troubleshooting/#troubleshoot-keda-errors-using-profiling

JorTurFer avatar Apr 11 '24 07:04 JorTurFer

Yes, its on latest version i.e. 2.13.1

> helm history keda -n keda
> REVISION	UPDATED                 	STATUS    	CHART      	APP VERSION	DESCRIPTION     
> 1       	Mon Apr  8 15:04:10 2024	superseded	keda-2.13.1	2.13.0     	Install complete
> 2       	Thu Apr 11 13:10:31 2024	deployed  	keda-2.13.1	2.13.0     	Upgrade complete

sushama-kothawale avatar Apr 11 '24 07:04 sushama-kothawale

Could you extract a memory profile and share it please? https://keda.sh/docs/2.13/troubleshooting/#troubleshoot-keda-errors-using-profiling

JorTurFer avatar Apr 11 '24 07:04 JorTurFer

I have arm machine:

uname -m
arm64

@JorTurFer Looks like https://go.dev/blog/pprof this tool is run on only amd64 machine. any other way /tool for profiling on arm?

sushama-kothawale avatar Apr 11 '24 07:04 sushama-kothawale

I think that it'll work with arm64 too, the post is quite old, can you give a try? pprof is a standard in golang

JorTurFer avatar Apr 11 '24 08:04 JorTurFer

As the keda-operator pods are restarting very frequently. not able to start profiling on that. Getting below error:

`go tool pprof http://localhost:8082/debug/pprof/profile

Fetching profile over HTTP from http://localhost:8082/debug/pprof/profile
http://localhost:8082/debug/pprof/profile: Get "http://localhost:8082/debug/pprof/profile": EOF
failed to fetch any source profiles

` And the port forward also gets exited after above command:

`kubectl port-forward keda-operator-8888c8544-qk8wn   -n keda 8082:8082
Forwarding from 127.0.0.1:8082 -> 8082
Forwarding from [::1]:8082 -> 8082
Handling connection for 8082
E0411 14:02:25.170512   38035 portforward.go:406] an error occurred forwarding 8082 -> 8082: error forwarding port 8082 to pod 339e41bc9574f9fa109a350c1efe3f3bdbdbc416338031e55f399b3a999cf672, uid : failed to execute portforward in network namespace "/var/run/netns/cni-a9b64bd0-a4db-c87e-373e-ba510cba82c3": failed to connect to localhost:8082 inside namespace "339e41bc9574f9fa109a350c1efe3f3bdbdbc416338031e55f399b3a999cf672", IPv4: dial tcp4 127.0.0.1:8082: connect: connection refused IPv6 dial tcp6: address localhost: no suitable address found 
E0411 14:02:25.171839   38035 portforward.go:234] lost connection to pod`

sushama-kothawale avatar Apr 11 '24 08:04 sushama-kothawale

Could you share your TriggerAuthentication too? are you using aws-eks or aws?

JorTurFer avatar Apr 11 '24 10:04 JorTurFer

using aws-eks

k get scaledobject -A 
NAMESPACE   NAME              SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   TRIGGERS   AUTHENTICATION                           READY     ACTIVE    FALLBACK   PAUSED    AGE
staging     dtl-dlm-pythonb   apps/v1.Deployment   dtl-dlm-pythonb   1     5     cpu                                                 True      False     False      Unknown   3d
staging     ntf-svc-nh-c      apps/v1.Deployment   ntf-svc-nh-c      1     2     memory     keda-trigger-auth-aws-credentials-irsa   Unknown   Unknown   Unknown    Unknown   177m
k get triggerauth -A  
NAMESPACE   NAME                                     PODIDENTITY   SECRET            ENV   VAULTADDRESS
staging     keda-trigger-auth-aws-credentials-irsa   aws-eks        

sushama-kothawale avatar Apr 11 '24 10:04 sushama-kothawale

Okey, if you remove the SQS ScaledObject temporally, does KEDA become stable? I mean, we are sure that's the SQS Scaler, right?

We detected some problems with aws-eks authentication and we developed a new one aws. From AWS pov it's the same but we have implemented some caches and other improvements.

If you are using the scaled workload role, you have to configure the TriggerAuthentication like:

spec:
  podIdentity:
    provider: aws
    identityOwner: workload

JorTurFer avatar Apr 11 '24 10:04 JorTurFer

Yes, we remove SQS scaledbject, keda becomes stable. No single restart on keda operator pods.

@JorTurFer Just now tried with aws with identityOwner as workload, still getting OOM issues on keda-operator:

k get scaledobject -A 
NAMESPACE   NAME              SCALETARGETKIND      SCALETARGETNAME   MIN   MAX   TRIGGERS   AUTHENTICATION                           READY     ACTIVE    FALLBACK   PAUSED    AGE
staging     dtl-dlm-pythonb   apps/v1.Deployment   dtl-dlm-pythonb   1     5     cpu                                                 True      False     False      Unknown   3d2h
staging     ntf-svc-nh-c      apps/v1.Deployment   ntf-svc-nh-c      1     2     memory     keda-trigger-auth-aws-credentials-irsa   Unknown   Unknown   Unknown    Unknown   5h21m
k get triggerauth -A  
NAMESPACE   NAME                                     PODIDENTITY   SECRET            ENV   VAULTADDRESS
staging     keda-trigger-auth-aws-credentials-irsa   aws          
 k get pods -n keda
NAME                                               READY   STATUS             RESTARTS      AGE
keda-admission-webhooks-595bf74cdf-swjdr           1/1     Running            0             3d3h
keda-operator-768cf8cd46-pcjs9                     0/1     CrashLoopBackOff   4 (13s ago)   7m48s
keda-operator-768cf8cd46-z6j9b                     0/1     CrashLoopBackOff   4 (45s ago)   7m48s
keda-operator-metrics-apiserver-84f9498d4d-k5xqs   1/1     Running            0             32h
keda-operator-metrics-apiserver-84f9498d4d-nlx6g   1/1     Running            0             4h59m

Adding triggerauth definition here:

k get triggerauth keda-trigger-auth-aws-credentials-irsa -n staging -o yaml
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  annotations:
    meta.helm.sh/release-name: devops--configs-staging
    meta.helm.sh/release-namespace: staging
  creationTimestamp: "2024-04-08T09:35:55Z"
  finalizers:
  - finalizer.keda.sh
  generation: 4
  labels:
    app.kubernetes.io/managed-by: Helm
  name: keda-trigger-auth-aws-credentials-irsa
  namespace: staging
  resourceVersion: "3074490872"
  uid: da7e8702-058e-400a-99c4-ae47c7adda75
spec:
  podIdentity:
    identityOwner: workload
    provider: aws
    roleArn: ""
```
`

sushama-kothawale avatar Apr 11 '24 12:04 sushama-kothawale

@zroubalik , any idea?

JorTurFer avatar Apr 11 '24 20:04 JorTurFer

@JorTurFer @zroubalik Any update on this? We are blocked on this to onboard some apps on prod with KEDA SQS scaler.

sushama-kothawale avatar Apr 15 '24 04:04 sushama-kothawale

This is strange. Also hard to say without details. Could you please try a different AWS scaler with the same style auth? And maybe start profiling on runing KEDA and then add the SQS scaledobject?

zroubalik avatar Apr 15 '24 11:04 zroubalik

We observed that keda is stable after fresh install. Once we add any scale object even cpu, memory one it start going OOM. And it continue in OOM loop.

Is there any way to write profiling data outside pod. So we can visualize even after pod restart

prayag-patel-mt avatar Apr 19 '24 14:04 prayag-patel-mt

Is there any way to write profiling data outside pod. So we can visualize even after pod restart

It's not possible to profile KEDA if it dies automatically just after start. Do you have a huge amount or resources in your cluster or so? I mean, are you facing with this on a huge cluster? Maybe any issue on the operator caches trying to pull a lot of resources ?

JorTurFer avatar Apr 29 '24 22:04 JorTurFer

@JorTurFer Yes our cluster is big, but we are facing this issue after upgrading keda from 2.11.* to 2.12 and to 2.13

Even we tried to uninstall keda and re-install it again. But as soon as we add sqs or prometheus, keda-operator pods go in OOM.

We are blocked on this for so longer. This impacting some of the apps scalability on prod environment.

As we can't do profiling due to pod crashing, are there any ways or flags that we can test out ?

sushama-kothawale avatar Apr 30 '24 10:04 sushama-kothawale

@JorTurFer Again did some analysis:

  1. Uninstalled keda (helm chart + all keda crds + all scaledobject and triggerauthentication resources)
  2. Installed keda latest version i.e. keda-2.14.0
  3. Monitored Keda operator for 30 minutes CPU and Memory usage of 2 keda-operator pods as below:
cpu- 0.0007, 0.0112
mem - 16MB, 38MB
  1. Deployed one app which has keda enabled with only CPU and memory scalers.
  2. Monitored Keda operator for next 30 minutes CPU and Memory usage of 2 keda-operator pods as below:
cpu- 0.0007, 0.0184
mem - 16MB, 256MB
  1. Deployed one app which has keda enabled with SQS scalers.
  2. As soon as this scaler is added Keda operator pods started crashing... pod status -
terminated
Reason:Reason: OOMKilled - exit code: 137
k get pods -n keda
NAME                                               READY   STATUS             RESTARTS        AGE
keda-admission-webhooks-579c8f6d95-tcbl7           1/1     Running            0               119m
keda-operator-75f5dd7c96-n7tnd                     0/1     CrashLoopBackOff   7 (2m15s ago)   119m
keda-operator-75f5dd7c96-nsw52                     0/1     CrashLoopBackOff   6 (4m11s ago)   15m
keda-operator-75f5dd7c96-q8snr                     0/1     CrashLoopBackOff   7 (2m46s ago)   119m
keda-operator-metrics-apiserver-55b44f9b88-972cs   1/1     Running            0               119m
keda-operator-metrics-apiserver-55b44f9b88-t2jds   1/1     Running            0               119m

  1. When we check usage on grafana, we see its in within specified limits as we have allocated below resources to keda operator:
      resources:
        limits:
          cpu: '1'
          memory: 4Gi
        requests:
          cpu: 100m
          memory: 2Gi

Attaching the grafana screenshot here. Screenshot 2024-04-30 at 5 46 16 PM

I can see lot of Network IO operations on keda-operator as soon as SQs scaler is added. Please refer above screenshot. Can this be the issue?

Appreciate if you can please look into this again.

sushama-kothawale avatar Apr 30 '24 12:04 sushama-kothawale

The network activity is between the operator and internet or between the KEDA components? the problem isn't v2.13 but 2.12 as far as I understand, right?

JorTurFer avatar May 02 '24 05:05 JorTurFer

I'm checking and there is a big important change between 2.11 and 2.12 related with the SDK https://github.com/kedacore/keda/commit/791c895bf53984220bef9981e161cf00d5fcc9ee because we changed the SDK v1 to the SDK v2, but it's the only change related with AWS that I can see and it doesn't justify the issue with prometheus.

When you are getting the pod logs, are you getting the previous container logs too, right ? (-p modifier)

JorTurFer avatar May 02 '24 05:05 JorTurFer

@JorTurFer The network activity is between the operator and internet. Till keda v2.11 version everything was working fine, post upgrading it to 2.12, 2.13 and 2.14 we are seeing OOM issues. Yes, I can see previous OOM killed container logs too.

k get pods -n keda                                             
NAME                                               READY   STATUS             RESTARTS      AGE
keda-admission-webhooks-579c8f6d95-tcbl7           1/1     Running            0             44h
keda-operator-75f5dd7c96-krghr                     0/1     CrashLoopBackOff   5 (80s ago)   18m
keda-operator-75f5dd7c96-tgzr4                     0/1     CrashLoopBackOff   5 (24s ago)   7m56s
keda-operator-metrics-apiserver-55b44f9b88-252qk   1/1     Running            0             19h
keda-operator-metrics-apiserver-55b44f9b88-972cs   1/1     Running            0             44h
sushamakothawale@Sushama-Kothawale DOPS-6666 % k logs keda-operator-75f5dd7c96-krghr  -n keda -p 
2024/05/02 06:24:51 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
2024-05-02T06:24:51Z	INFO	setup	Starting manager
2024-05-02T06:24:51Z	INFO	setup	KEDA Version: 2.14.0
2024-05-02T06:24:51Z	INFO	setup	Git Commit: 6681d5ecbc82293626cab693625c9cc9fb453381
2024-05-02T06:24:51Z	INFO	setup	Go Version: go1.21.9
2024-05-02T06:24:51Z	INFO	setup	Go OS/Arch: linux/amd64
2024-05-02T06:24:51Z	INFO	setup	Running on Kubernetes 1.28+	{"version": "v1.28.7-eks-b9c9ed7"}
2024-05-02T06:24:51Z	INFO	starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0502 06:24:51.761839       1 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
I0502 06:25:08.793430       1 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh
2024-05-02T06:25:08Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
2024-05-02T06:25:08Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
2024-05-02T06:25:08Z	INFO	Starting Controller	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-05-02T06:25:08Z	INFO	Starting EventSource	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
2024-05-02T06:25:08Z	INFO	Starting Controller	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-05-02T06:25:08Z	INFO	Starting EventSource	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
2024-05-02T06:25:08Z	INFO	Starting Controller	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-05-02T06:25:08Z	INFO	Starting EventSource	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "source": "kind source: *v1alpha1.CloudEventSource"}
2024-05-02T06:25:08Z	INFO	Starting Controller	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-05-02T06:25:08Z	INFO	Starting EventSource	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
2024-05-02T06:25:08Z	INFO	Starting Controller	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-05-02T06:25:08Z	INFO	cert-rotation	starting cert rotator controller
2024-05-02T06:25:08Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-05-02T06:25:08Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-05-02T06:25:08Z	INFO	Starting EventSource	{"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-05-02T06:25:08Z	INFO	Starting Controller	{"controller": "cert-rotator"}
2024-05-02T06:25:08Z	INFO	cert-rotation	no cert refresh needed
2024-05-02T06:25:08Z	INFO	cert-rotation	certs are ready in /certs
2024-05-02T06:25:08Z	INFO	Starting workers	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
2024-05-02T06:25:08Z	INFO	Starting workers	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
2024-05-02T06:25:08Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2d7f0ebe-8be2-482f-a9a1-57fc7988071b"}
2024-05-02T06:25:08Z	INFO	Starting workers	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
2024-05-02T06:25:08Z	INFO	Reconciling ScaledObject	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "fa96a24f-b9b3-43a7-ab06-120fe844d5a3"}
2024-05-02T06:25:08Z	INFO	Detected resource targeted for scaling	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "fa96a24f-b9b3-43a7-ab06-120fe844d5a3", "resource": "apps/v1.Deployment", "name": "dtl-dlm-pythonb"}
2024-05-02T06:25:08Z	INFO	Detected resource targeted for scaling	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"ntf-svc-nh-c","namespace":"staging"}, "namespace": "staging", "name": "ntf-svc-nh-c", "reconcileID": "2d7f0ebe-8be2-482f-a9a1-57fc7988071b", "resource": "apps/v1.Deployment", "name": "ntf-svc-nh-c"}
2024-05-02T06:25:08Z	INFO	Starting workers	{"controller": "cert-rotator", "worker count": 1}
2024-05-02T06:25:08Z	INFO	cert-rotation	no cert refresh needed
2024-05-02T06:25:08Z	INFO	cert-rotation	Ensuring CA cert	{"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-05-02T06:25:09Z	INFO	cert-rotation	Ensuring CA cert	{"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-05-02T06:25:09Z	INFO	cert-rotation	no cert refresh needed
2024-05-02T06:25:09Z	INFO	cert-rotation	Ensuring CA cert	{"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration"}
2024-05-02T06:25:09Z	INFO	cert-rotation	Ensuring CA cert	{"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-05-02T06:25:09Z	INFO	Starting workers	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "worker count": 1}
2024-05-02T06:25:09Z	INFO	Starting workers	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
2024-05-02T06:25:10Z	INFO	cert-rotation	CA certs are injected to webhooks
2024-05-02T06:25:10Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
2024-05-02T06:25:10Z	INFO	Initializing Scaling logic according to ScaledObject Specification	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "ScaledObject": {"name":"dtl-dlm-pythonb","namespace":"staging"}, "namespace": "staging", "name": "dtl-dlm-pythonb", "reconcileID": "fa96a24f-b9b3-43a7-ab06-120fe844d5a3"}

We are using only 2 external sclers as of now i.e. SQS and Prometheus . And for both of these scales we see OOM issues.

sushama-kothawale avatar May 02 '24 06:05 sushama-kothawale

At this point, and not having any way to extract the pprof info, the only option that I can offer is debugging it locally. You can use your current cluster configuration, just having the cluster selected in your kubeconfig. From local, you can check if you have the same issue or not. To local debug the code, you just need to follow this step-by-step guide and you will be able to run KEDA operator in your local machine: https://github.com/kedacore/keda/blob/main/BUILD.md#debugging-with-vs-code (just ignore the part of make deploy because you already have KEDA installed on your cluster). There you can see if the memory grows or not, and if yes, you can use pprof with your local process

JorTurFer avatar May 04 '24 11:05 JorTurFer