keda
keda copied to clipboard
keda operator pod crashes daily once with an error code 2
Keda operator pod crashes daily once with an error code 2 even kept ideal (autoscaling got triggered or not) Previous logs showed following different errors:
-
panic: reflect: slice index out of range
-
panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19c9182]
Expected Behavior
keda-operator should not crash
Actual Behavior
keda operator pod crashes daily once with an error code 2
Steps to Reproduce the Problem
- Install keda helm chart version 2.13.0 on GKE 1.27
- Wait for a day with or without any load / autoscaling
- Keda operator pod will show restart(s)
Specifications
- KEDA Version: 2.13.0
- Kubernetes Version: 1.27
- Scaler(s): Prometheus
Keda Operator Pod Status:
Attaching complete keda operator stacktrace of previous container run
-
slice index out of range
issue keda-operator-stacktrace.log -
invalid memory address or nil pointer dereference
keda-stacktrace-SIGSEGV.log
PS: Autoscaling is not affected significantly (even though we get prometheus query timeout issue at random intervals, it does get the metric on retries), but we are looking forward to find a root cause of keda pod getting crashed
PS: Autoscaling is not affected significantly (even though we get prometheus query timeout issue at random intervals, it does get the metric on retries), but we are looking forward to find a root cause of keda pod getting crashed
I've not checked it yet, but it looks as an issue with the internal cache. WDYT @zroubalik ?
@mustaFAB53 thanks for reporting. Could you please also share the ScaledObject that causes this?
Hi, the polling interval set to 1s is too aggressive. You Prometheus Server instance is not able to properly respond in time. I would definitely recommend you to extend the polling interval to at least 30s and then try to find a lower value that's reasonable for you and you don't see following problems in the output:
{"type": "ScaledObject", "namespace": "app1", "name": "myapp", "error": "Get \"http://prometheus_frontend:9090/api/v1/query?query=truncated_query&time=2024-02-28T09:59:41Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
/workspace/pkg/scalers/prometheus_scaler.go:391
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
/workspace/pkg/scaling/cache/scalers_cache.go:130
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
/workspace/pkg/scaling/scale_handler.go:743
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
/workspace/pkg/scaling/scale_handler.go:628
2024-02-28T10:00:48Z ERROR prometheus_scaler error executing prometheus query {"type": "ScaledObject", "namespace": "app1", "name": "myapp", "error": "Get \"http://prometheus_frontend:9090/api/v1/query?query=truncated_query&time=2024-02-28T10:00:45Z\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
github.com/kedacore/keda/v2/pkg/scalers.(*prometheusScaler).GetMetricsAndActivity
/workspace/pkg/scalers/prometheus_scaler.go:391
github.com/kedacore/keda/v2/pkg/scaling/cache.(*ScalersCache).GetMetricsAndActivityForScaler
/workspace/pkg/scaling/cache/scalers_cache.go:130
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
/workspace/pkg/scaling/scale_handler.go:743
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
/workspace/pkg/scaling/scale_handler.go:628
2024-02-28T10:02:53Z ERROR prometheus_scaler error executing prometheus query
You can also try to tweak HTTP related settings: https://keda.sh/docs/2.13/operate/cluster/#http-timeouts
Hi @zroubalik,
We have kept the polling interval this aggressive as we wanted scale up to happen immediately considering spike traffic. I will try increasing it to check if keda pod doesn't get crashed.
Regarding timeout settings, I had already tried to set it to 20000 (20s) but could not see any improvement.
@zroubalik i am also facing this issue in keda version 2.11.0
@mustaFAB53 I understand, but in this case you should also boost your Prometheus, as it is the origin of the problems - it is not able to respon in time.
+1 panic: runtime error: invalid memory address or nil pointer dereference
has anyone one is working on a fix?
is there something we can do to avoid getting this?
KEDA 2.11 , K8S 1.27
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.