keda keda operator pod crashes daily once with an error code 2

Keda operator pod crashes daily once with an error code 2 even kept ideal (autoscaling got triggered or not) Previous logs showed following different errors:

panic: reflect: slice index out of range
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19c9182]

Expected Behavior

keda-operator should not crash

Actual Behavior

keda operator pod crashes daily once with an error code 2

Steps to Reproduce the Problem

Install keda helm chart version 2.13.0 on GKE 1.27
Wait for a day with or without any load / autoscaling
Keda operator pod will show restart(s)

Specifications

KEDA Version: 2.13.0
Kubernetes Version: 1.27
Scaler(s): Prometheus

Keda Operator Pod Status: Screenshot from 2024-02-28 16-13-17

Attaching complete keda operator stacktrace of previous container run

slice index out of range issue keda-operator-stacktrace.log
invalid memory address or nil pointer dereference keda-stacktrace-SIGSEGV.log

PS: Autoscaling is not affected significantly (even though we get prometheus query timeout issue at random intervals, it does get the metric on retries), but we are looking forward to find a root cause of keda pod getting crashed

Feb 28 '24 10:02 mustaFAB53

PS: Autoscaling is not affected significantly (even though we get prometheus query timeout issue at random intervals, it does get the metric on retries), but we are looking forward to find a root cause of keda pod getting crashed

I've not checked it yet, but it looks as an issue with the internal cache. WDYT @zroubalik ?

Feb 29 '24 17:02 JorTurFer

@mustaFAB53 thanks for reporting. Could you please also share the ScaledObject that causes this?

Mar 03 '24 18:03 zroubalik

Hi @zroubalik,

Attaching the scaledobject kubernetes manifest being applied scaledobject.zip

Mar 07 '24 10:03 mustaFAB53

Hi, the polling interval set to 1s is too aggressive. You Prometheus Server instance is not able to properly respond in time. I would definitely recommend you to extend the polling interval to at least 30s and then try to find a lower value that's reasonable for you and you don't see following problems in the output:

	{"type": "ScaledObject", "namespace": "app1", "name": "myapp", "error": "Get \"http://prometheus_frontend:9090/api/v1/query?query=truncated_query&time=2024-02-28T09:59:41Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
	/workspace/pkg/scalers/prometheus_scaler.go:391
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:130
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
	/workspace/pkg/scaling/scale_handler.go:743
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
	/workspace/pkg/scaling/scale_handler.go:628
2024-02-28T10:00:48Z	ERROR	prometheus_scaler	error executing prometheus query	{"type": "ScaledObject", "namespace": "app1", "name": "myapp", "error": "Get \"http://prometheus_frontend:9090/api/v1/query?query=truncated_query&time=2024-02-28T10:00:45Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
	/workspace/pkg/scalers/prometheus_scaler.go:391
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:130
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
	/workspace/pkg/scaling/scale_handler.go:743
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
	/workspace/pkg/scaling/scale_handler.go:628
2024-02-28T10:02:53Z	ERROR	prometheus_scaler	error executing prometheus query

You can also try to tweak HTTP related settings: https://keda.sh/docs/2.13/operate/cluster/#http-timeouts

Mar 07 '24 10:03 zroubalik

Hi @zroubalik,

We have kept the polling interval this aggressive as we wanted scale up to happen immediately considering spike traffic. I will try increasing it to check if keda pod doesn't get crashed.

Regarding timeout settings, I had already tried to set it to 20000 (20s) but could not see any improvement.

Mar 08 '24 04:03 mustaFAB53

@zroubalik i am also facing this issue in keda version 2.11.0

Apr 02 '24 05:04 Pixis-Akshay-Gopani

@mustaFAB53 I understand, but in this case you should also boost your Prometheus, as it is the origin of the problems - it is not able to respon in time.

Apr 10 '24 16:04 zroubalik

+1 panic: runtime error: invalid memory address or nil pointer dereference has anyone one is working on a fix? is there something we can do to avoid getting this?

KEDA 2.11 , K8S 1.27

May 21 '24 11:05 shmuelarditi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

Jul 20 '24 18:07 stale[bot]

This issue has been automatically closed due to inactivity.

Jul 30 '24 23:07 stale[bot]

keda keda copied to clipboard

keda operator pod crashes daily once with an error code 2

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

keda
keda copied to clipboard