keda icon indicating copy to clipboard operation
keda copied to clipboard

keda operator pod crashes daily once with an error code 2

Open mustaFAB53 opened this issue 1 year ago • 7 comments

Keda operator pod crashes daily once with an error code 2 even kept ideal (autoscaling got triggered or not) Previous logs showed following different errors:

  • panic: reflect: slice index out of range
  • panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19c9182]

Expected Behavior

keda-operator should not crash

Actual Behavior

keda operator pod crashes daily once with an error code 2

Steps to Reproduce the Problem

  1. Install keda helm chart version 2.13.0 on GKE 1.27
  2. Wait for a day with or without any load / autoscaling
  3. Keda operator pod will show restart(s)

Specifications

  • KEDA Version: 2.13.0
  • Kubernetes Version: 1.27
  • Scaler(s): Prometheus

Keda Operator Pod Status: Screenshot from 2024-02-28 16-13-17

Attaching complete keda operator stacktrace of previous container run

PS: Autoscaling is not affected significantly (even though we get prometheus query timeout issue at random intervals, it does get the metric on retries), but we are looking forward to find a root cause of keda pod getting crashed

mustaFAB53 avatar Feb 28 '24 10:02 mustaFAB53

PS: Autoscaling is not affected significantly (even though we get prometheus query timeout issue at random intervals, it does get the metric on retries), but we are looking forward to find a root cause of keda pod getting crashed

I've not checked it yet, but it looks as an issue with the internal cache. WDYT @zroubalik ?

JorTurFer avatar Feb 29 '24 17:02 JorTurFer

@mustaFAB53 thanks for reporting. Could you please also share the ScaledObject that causes this?

zroubalik avatar Mar 03 '24 18:03 zroubalik

Hi @zroubalik,

Attaching the scaledobject kubernetes manifest being applied scaledobject.zip

mustaFAB53 avatar Mar 07 '24 10:03 mustaFAB53

Hi, the polling interval set to 1s is too aggressive. You Prometheus Server instance is not able to properly respond in time. I would definitely recommend you to extend the polling interval to at least 30s and then try to find a lower value that's reasonable for you and you don't see following problems in the output:

	{"type": "ScaledObject", "namespace": "app1", "name": "myapp", "error": "Get \"http://prometheus_frontend:9090/api/v1/query?query=truncated_query&time=2024-02-28T09:59:41Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
	/workspace/pkg/scalers/prometheus_scaler.go:391
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:130
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
	/workspace/pkg/scaling/scale_handler.go:743
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
	/workspace/pkg/scaling/scale_handler.go:628
2024-02-28T10:00:48Z	ERROR	prometheus_scaler	error executing prometheus query	{"type": "ScaledObject", "namespace": "app1", "name": "myapp", "error": "Get \"http://prometheus_frontend:9090/api/v1/query?query=truncated_query&time=2024-02-28T10:00:45Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
	/workspace/pkg/scalers/prometheus_scaler.go:391
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
	/workspace/pkg/scaling/cache/scalers_cache.go:130
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
	/workspace/pkg/scaling/scale_handler.go:743
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
	/workspace/pkg/scaling/scale_handler.go:628
2024-02-28T10:02:53Z	ERROR	prometheus_scaler	error executing prometheus query

You can also try to tweak HTTP related settings: https://keda.sh/docs/2.13/operate/cluster/#http-timeouts

zroubalik avatar Mar 07 '24 10:03 zroubalik

Hi @zroubalik,

We have kept the polling interval this aggressive as we wanted scale up to happen immediately considering spike traffic. I will try increasing it to check if keda pod doesn't get crashed.

Regarding timeout settings, I had already tried to set it to 20000 (20s) but could not see any improvement.

mustaFAB53 avatar Mar 08 '24 04:03 mustaFAB53

@zroubalik i am also facing this issue in keda version 2.11.0

Pixis-Akshay-Gopani avatar Apr 02 '24 05:04 Pixis-Akshay-Gopani

@mustaFAB53 I understand, but in this case you should also boost your Prometheus, as it is the origin of the problems - it is not able to respon in time.

zroubalik avatar Apr 10 '24 16:04 zroubalik

+1 panic: runtime error: invalid memory address or nil pointer dereference has anyone one is working on a fix? is there something we can do to avoid getting this?

KEDA 2.11 , K8S 1.27

shmuelarditi avatar May 21 '24 11:05 shmuelarditi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 20 '24 18:07 stale[bot]

This issue has been automatically closed due to inactivity.

stale[bot] avatar Jul 30 '24 23:07 stale[bot]