keda icon indicating copy to clipboard operation
keda copied to clipboard

Could not wait for Cache to sync - Keda pod restarts

Open shalevdad opened this issue 3 years ago • 2 comments

Report

Hi,

We have a ScaledJob object, which runs a long running job (around 45 min) upon a AWS SQS message received. Randomly in the middle of the execution of the job, the Job terminates (and not because the job is finished running). I've tailed Keda Operator logs while the execution rans, and I noticed the pod has crashed with the error - "failed to wait for scaledjob caches to sync: timed out waiting for cache to be synced". sometimes, there is no error and a simple "Stopping and waiting for non leader election runnables" and seems like an expected stopping of the pod. The exact same time the Keda pod restarted, the Job terminated, which caused the job to spawn up and down in an infinite loop (since the SQS message is still there but is in-flight).

Important to note - Keda Operator pods are not killed when jobs are not running, meaning it's related to the jobs being executed and not happens on its own.

Keda version is 2.7.2 (2.7.1 is the latest version listed in the options below)

Thanks a lot in advance.

Expected Behavior

Keda Operator pod to continue running, and even if dies, not to kill it's associated jobs with it.

Actual Behavior

Keda Operator to keep run normally.

Steps to Reproduce the Problem

  1. ScaledJob upon SQS message (In our env it was 10-15 Jobs and 1 SQS message per job)
  2. Let the Jobs execute for around 45 min
  3. watch keda operator pod

Logs from KEDA operator

[keda-operator-68cd48977c-nbrc7] 1.655822348640486e+09  INFO    Stopping and waiting for non leader election runnables 
[keda-operator-68cd48977c-nbrc7] 1.6558223486405544e+09 INFO    Stopping and waiting for leader election runnables 
[keda-operator-68cd48977c-nbrc7] 1.6558223486407125e+09 INFO    controller.scaledobject Starting EventSource     {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"} 
[keda-operator-68cd48977c-nbrc7] 1.655822348640763e+09  INFO    controller.scaledobject Starting EventSource     {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "source": "kind source: *v2beta2.HorizontalPodAutoscaler"} 
[keda-operator-68cd48977c-nbrc7] 1.6558223486407743e+09 INFO    controller.scaledobject Starting Controller      {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject"} 
[keda-operator-68cd48977c-nbrc7] 1.655822348640792e+09  ERROR   controller.scaledobject Could not wait for Cache to sync {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "error": "failed to wait for scaledobject caches to sync: timed out waiting for cache to be synced"} 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218 
[keda-operator-68cd48977c-nbrc7] 1.6558223486408758e+09 ERROR   error received after stop sequence was engaged   {"error": "failed to wait for scaledobject caches to sync: timed out waiting for cache to be synced"} 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541 
[keda-operator-68cd48977c-nbrc7] 1.6558223486410434e+09 INFO    controller.clustertriggerauthentication  Starting EventSource    {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"} 
[keda-operator-68cd48977c-nbrc7] 1.6558223486410754e+09 INFO    controller.clustertriggerauthentication  Starting Controller     {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication"} 
[keda-operator-68cd48977c-nbrc7] 1.6558223486411026e+09 ERROR   controller.clustertriggerauthentication  Could not wait for Cache to sync        {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication", "error": "failed to wait for clustertriggerauthentication caches to sync: timed out waiting for cache to be synced"} 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218 
[keda-operator-68cd48977c-nbrc7] 1.655822348641151e+09  ERROR   error received after stop sequence was engaged   {"error": "failed to wait for clustertriggerauthentication caches to sync: timed out waiting for cache to be synced"} 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541 
[keda-operator-68cd48977c-nbrc7] 1.655822348641224e+09  INFO    controller.triggerauthenticationStarting EventSource     {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"} 
[keda-operator-68cd48977c-nbrc7] 1.6558223486412575e+09 INFO    controller.triggerauthenticationStarting Controller      {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication"} 
[keda-operator-68cd48977c-nbrc7] 1.6558223486412725e+09 ERROR   controller.triggerauthenticationCould not wait for Cache to sync {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication", "error": "failed to wait for triggerauthentication caches to sync: timed out waiting for cache to be synced"} 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218 
[keda-operator-68cd48977c-nbrc7] 1.6558223486413043e+09 INFO    controller.scaledjob    Starting EventSource     {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"} 
[keda-operator-68cd48977c-nbrc7] 1.6558223486413198e+09 ERROR   error received after stop sequence was engaged   {"error": "failed to wait for triggerauthentication caches to sync: timed out waiting for cache to be synced"} 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541 
[keda-operator-68cd48977c-nbrc7] 1.655822348641335e+09  INFO    controller.scaledjob    Starting Controller      {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob"} 
[keda-operator-68cd48977c-nbrc7] 1.655822348641384e+09  ERROR   controller.scaledjob    Could not wait for Cache to sync {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob", "error": "failed to wait for scaledjob caches to sync: timed out waiting for cache to be synced"} 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218 
[keda-operator-68cd48977c-nbrc7] 1.655822348641413e+09  INFO    Stopping and waiting for caches 
[keda-operator-68cd48977c-nbrc7] 1.6558223486415174e+09 ERROR   error received after stop sequence was engaged   {"error": "failed to wait for scaledjob caches to sync: timed out waiting for cache to be synced"} 
[keda-operator-68cd48977c-nbrc7] sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1 
[keda-operator-68cd48977c-nbrc7]        /go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541 
[keda-operator-68cd48977c-nbrc7] 1.655822348641581e+09  INFO    Stopping and waiting for webhooks 
[keda-operator-68cd48977c-nbrc7] 1.6558223486416256e+09 INFO    Wait completed, proceeding to shutdown the manager

Alternatively -

[keda-operator-68cd48977c-sgkbq] 1.655881240258443e+09  INFO    Stopping and waiting for non leader election runnables 
[keda-operator-68cd48977c-sgkbq] 1.655881240258501e+09  INFO    Stopping and waiting for leader election runnables 
[keda-operator-68cd48977c-sgkbq] 1.6558812402585893e+09 INFO    controller.clustertriggerauthentication Shutdown signal received, waiting for all workers to finish     {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication"} 
[keda-operator-68cd48977c-sgkbq] 1.6558812402586226e+09 INFO    controller.scaledjob    Shutdown signal received, waiting for all workers to finish     {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob"} 
[keda-operator-68cd48977c-sgkbq] 1.6558812402586598e+09 INFO    controller.scaledobject Shutdown signal received, waiting for all workers to finish     {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject"} 
[keda-operator-68cd48977c-sgkbq] 1.6558812402586727e+09 INFO    controller.triggerauthentication        Shutdown signal received, waiting for all workers to finish     {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication"} 
[keda-operator-68cd48977c-sgkbq] 1.6558812402587812e+09 INFO    controller.clustertriggerauthentication All workers finished    {"reconciler group": "keda.sh", "reconciler kind": "ClusterTriggerAuthentication"} 
[keda-operator-68cd48977c-sgkbq] 1.655881240258801e+09  INFO    controller.scaledjob    All workers finished    {"reconciler group": "keda.sh", "reconciler kind": "ScaledJob"} 
[keda-operator-68cd48977c-sgkbq] 1.655881240258813e+09  INFO    controller.triggerauthentication        All workers finished    {"reconciler group": "keda.sh", "reconciler kind": "TriggerAuthentication"} 
[keda-operator-68cd48977c-sgkbq] 1.655881240258837e+09  INFO    controller.scaledobject All workers finished    {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject"} 
[keda-operator-68cd48977c-sgkbq] 1.6558812402588584e+09 INFO    Stopping and waiting for caches 
[keda-operator-68cd48977c-sgkbq] 1.6558812402593951e+09 INFO    Stopping and waiting for webhooks 
[keda-operator-68cd48977c-sgkbq] 1.6558812402594314e+09 INFO    Wait completed, proceeding to shutdown the manager 

KEDA Version

2.7.1

Kubernetes Version

1.21

Platform

Amazon Web Services

Scaler Details

AWS SQS

Anything else?

No response

shalevdad avatar Jun 22 '22 07:06 shalevdad

Based on the event, it seems like that the Operator is not able to talk to k8s cache. Is your Kubernetes server all right?

[keda-operator-68cd48977c-nbrc7] 1.655822348640792e+09  ERROR   controller.scaledobject Could not wait for Cache to sync {"reconciler group": "keda.sh", "reconciler kind": "ScaledObject", "error": "failed to wait for scaledobject caches to sync: timed out waiting for cache to be synced"} 

zroubalik avatar Jun 23 '22 14:06 zroubalik

Running on EKS, happened on different days so I don't think it is related to AWS. Actually the cache error just appeared once, the second message happens way more frequently (Stopping and waiting for non leader election runnables)

shalevdad avatar Jun 23 '22 16:06 shalevdad

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 31 '22 00:08 stale[bot]

This issue has been automatically closed due to inactivity.

stale[bot] avatar Sep 07 '22 13:09 stale[bot]