Could not wait for Cache to sync - Keda pod restarts
Report
Hi,
We have a ScaledJob object, which runs a long running job (around 45 min) upon a AWS SQS message received. Randomly in the middle of the execution of the job, the Job terminates (and not because the job is finished running). I've tailed Keda Operator logs while the execution rans, and I noticed the pod has crashed with the error - "failed to wait for scaledjob caches to sync: timed out waiting for cache to be synced". sometimes, there is no error and a simple "Stopping and waiting for non leader election runnables" and seems like an expected stopping of the pod. The exact same time the Keda pod restarted, the Job terminated, which caused the job to spawn up and down in an infinite loop (since the SQS message is still there but is in-flight).
Important to note - Keda Operator pods are not killed when jobs are not running, meaning it's related to the jobs being executed and not happens on its own.
Keda version is 2.7.2 (2.7.1 is the latest version listed in the options below)
Thanks a lot in advance.
Expected Behavior
Keda Operator pod to continue running, and even if dies, not to kill it's associated jobs with it.
Actual Behavior
Keda Operator to keep run normally.
Steps to Reproduce the Problem
- ScaledJob upon SQS message (In our env it was 10-15 Jobs and 1 SQS message per job)
- Let the Jobs execute for around 45 min
- watch keda operator pod
Logs from KEDA operator
[keda-operator-68cd48977c-nbrc7] 1.655822348640486e+09 INFO Stopping and waiting for non leader election runnables
[keda-operator-68cd48977c-nbrc7] 1.6558223486405544e+09 INFO Stopping and waiting for leader election runnables
[keda-operator-68cd48977c-nbrc7] 1.6558223486407125e+09 INFO controller.scaledobject Starting EventSource {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
[keda-operator-68cd48977c-nbrc7] 1.655822348640763e+09 INFO controller.scaledobject Starting EventSource {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "source": "kind source: *v2beta2.HorizontalPodAutoscaler"}
[keda-operator-68cd48977c-nbrc7] 1.6558223486407743e+09 INFO controller.scaledobject Starting Controller {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject"}
[keda-operator-68cd48977c-nbrc7] 1.655822348640792e+09 ERROR controller.scaledobject Could not wait for Cache to sync {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "error": "failed to wait for scaledobject caches to sync: timed out waiting for cache to be synced"}
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218
[keda-operator-68cd48977c-nbrc7] 1.6558223486408758e+09 ERROR error received after stop sequence was engaged {"error": "failed to wait for scaledobject caches to sync: timed out waiting for cache to be synced"}
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541
[keda-operator-68cd48977c-nbrc7] 1.6558223486410434e+09 INFO controller.clustertriggerauthentication Starting EventSource {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
[keda-operator-68cd48977c-nbrc7] 1.6558223486410754e+09 INFO controller.clustertriggerauthentication Starting Controller {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication"}
[keda-operator-68cd48977c-nbrc7] 1.6558223486411026e+09 ERROR controller.clustertriggerauthentication Could not wait for Cache to sync {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication", "error": "failed to wait for clustertriggerauthentication caches to sync: timed out waiting for cache to be synced"}
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218
[keda-operator-68cd48977c-nbrc7] 1.655822348641151e+09 ERROR error received after stop sequence was engaged {"error": "failed to wait for clustertriggerauthentication caches to sync: timed out waiting for cache to be synced"}
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541
[keda-operator-68cd48977c-nbrc7] 1.655822348641224e+09 INFO controller.triggerauthenticationStarting EventSource {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
[keda-operator-68cd48977c-nbrc7] 1.6558223486412575e+09 INFO controller.triggerauthenticationStarting Controller {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication"}
[keda-operator-68cd48977c-nbrc7] 1.6558223486412725e+09 ERROR controller.triggerauthenticationCould not wait for Cache to sync {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication", "error": "failed to wait for triggerauthentication caches to sync: timed out waiting for cache to be synced"}
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218
[keda-operator-68cd48977c-nbrc7] 1.6558223486413043e+09 INFO controller.scaledjob Starting EventSource {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
[keda-operator-68cd48977c-nbrc7] 1.6558223486413198e+09 ERROR error received after stop sequence was engaged {"error": "failed to wait for triggerauthentication caches to sync: timed out waiting for cache to be synced"}
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541
[keda-operator-68cd48977c-nbrc7] 1.655822348641335e+09 INFO controller.scaledjob Starting Controller {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob"}
[keda-operator-68cd48977c-nbrc7] 1.655822348641384e+09 ERROR controller.scaledjob Could not wait for Cache to sync {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob", "error": "failed to wait for scaledjob caches to sync: timed out waiting for cache to be synced"}
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218
[keda-operator-68cd48977c-nbrc7] 1.655822348641413e+09 INFO Stopping and waiting for caches
[keda-operator-68cd48977c-nbrc7] 1.6558223486415174e+09 ERROR error received after stop sequence was engaged {"error": "failed to wait for scaledjob caches to sync: timed out waiting for cache to be synced"}
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1
[keda-operator-68cd48977c-nbrc7] /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541
[keda-operator-68cd48977c-nbrc7] 1.655822348641581e+09 INFO Stopping and waiting for webhooks
[keda-operator-68cd48977c-nbrc7] 1.6558223486416256e+09 INFO Wait completed, proceeding to shutdown the manager
Alternatively -
[keda-operator-68cd48977c-sgkbq] 1.655881240258443e+09 INFO Stopping and waiting for non leader election runnables
[keda-operator-68cd48977c-sgkbq] 1.655881240258501e+09 INFO Stopping and waiting for leader election runnables
[keda-operator-68cd48977c-sgkbq] 1.6558812402585893e+09 INFO controller.clustertriggerauthentication Shutdown signal received, waiting for all workers to finish {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication"}
[keda-operator-68cd48977c-sgkbq] 1.6558812402586226e+09 INFO controller.scaledjob Shutdown signal received, waiting for all workers to finish {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob"}
[keda-operator-68cd48977c-sgkbq] 1.6558812402586598e+09 INFO controller.scaledobject Shutdown signal received, waiting for all workers to finish {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject"}
[keda-operator-68cd48977c-sgkbq] 1.6558812402586727e+09 INFO controller.triggerauthentication Shutdown signal received, waiting for all workers to finish {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication"}
[keda-operator-68cd48977c-sgkbq] 1.6558812402587812e+09 INFO controller.clustertriggerauthentication All workers finished {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication"}
[keda-operator-68cd48977c-sgkbq] 1.655881240258801e+09 INFO controller.scaledjob All workers finished {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob"}
[keda-operator-68cd48977c-sgkbq] 1.655881240258813e+09 INFO controller.triggerauthentication All workers finished {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication"}
[keda-operator-68cd48977c-sgkbq] 1.655881240258837e+09 INFO controller.scaledobject All workers finished {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject"}
[keda-operator-68cd48977c-sgkbq] 1.6558812402588584e+09 INFO Stopping and waiting for caches
[keda-operator-68cd48977c-sgkbq] 1.6558812402593951e+09 INFO Stopping and waiting for webhooks
[keda-operator-68cd48977c-sgkbq] 1.6558812402594314e+09 INFO Wait completed, proceeding to shutdown the manager
KEDA Version
2.7.1
Kubernetes Version
1.21
Platform
Amazon Web Services
Scaler Details
AWS SQS
Anything else?
No response
Based on the event, it seems like that the Operator is not able to talk to k8s cache. Is your Kubernetes server all right?
[keda-operator-68cd48977c-nbrc7] 1.655822348640792e+09 ERROR controller.scaledobject Could not wait for Cache to sync {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "error": "failed to wait for scaledobject caches to sync: timed out waiting for cache to be synced"}
Running on EKS, happened on different days so I don't think it is related to AWS. Actually the cache error just appeared once, the second message happens way more frequently (Stopping and waiting for non leader election runnables)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity.