cluster-operator RabbitMQ Operator Pods restarts every 3-4 days (leader election lost)

Describe the bug

After deploying the RabbitMQ operator with the Bitnami Helm charts, the RabbitMQ Cluster works perfectly but operator pods restarts every 3 to 4 days indicating that they could not renew the leader lease.

To Reproduce

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-04-14T08:43:11Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"}

Deploy RabbitMQ Cluster Operator
Deploy a RabbitMQ Cluster

messaging-topology-operator logs :

2022-07-04 11:29:58	{"level":"error","ts":1656926998.6014247,"logger":"messaging-topology-operator","msg":"Failed to update lock: Put \"https://X.X.X.X:443/apis/coordination.k8s.io/v1/namespaces/k8s-system/leases/messaging-topology-operator-leader-election\": context deadline exceeded\n","stacktrace":"k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1.1\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:272\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:580\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:545\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntil\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:536\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:271\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).renew\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:268\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:212\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:643"}
2022-07-04 11:29:58	{"level":"info","ts":1656926998.601745,"logger":"messaging-topology-operator","msg":"failed to renew lease k8s-system/messaging-topology-operator-leader-election: timed out waiting for the condition\n"}
2022-07-04 11:29:58	{"level":"error","ts":1656926998.6018116,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"main.main\n\t/bitnami/blacksmith-sandox/rmq-messaging-topology-operator-1.7.0/src/github.com/rabbitmq/rmq-messaging-topology-operator/main.go:286\nruntime.main\n\t/usr/local/go-1.17/src/runtime/proc.go:255"}

rabbitmq-cluster-operator logs :

2022-07-04 11:29:59	{"level":"error","ts":1656926999.274908,"logger":"rabbitmq-cluster-operator","msg":"error retrieving resource lock k8s-system/rabbitmq-cluster-operator-leader-election: Get \"https://X.X.X.X:443/apis/coordination.k8s.io/v1/namespaces/k8s-system/leases/rabbitmq-cluster-operator-leader-election\": context deadline exceeded\n","stacktrace":"k8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1.1\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:272\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.poll\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:580\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:545\nk8s.io/apimachinery/pkg/util/wait.PollImmediateUntil\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:536\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).renew.func1\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:271\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).renew\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:268\nk8s.io/client-go/tools/leaderelection.(*LeaderElector).Run\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/k8s.io/[email protected]/tools/leaderelection/leaderelection.go:212\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).startLeaderElection.func3\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:643"}
2022-07-04 11:29:59	{"level":"info","ts":1656926999.2751236,"logger":"rabbitmq-cluster-operator","msg":"failed to renew lease k8s-system/rabbitmq-cluster-operator-leader-election: timed out waiting for the condition\n"}
2022-07-04 11:29:59	{"level":"error","ts":1656926999.2752035,"logger":"setup","msg":"problem running manager","error":"leader election lost","stacktrace":"main.main\n\t/bitnami/blacksmith-sandox/rabbitmq-cluster-operator-1.14.0/src/github.com/rabbitmq/cluster-operator/main.go:151\nruntime.main\n\t/usr/local/go-1.17/src/runtime/proc.go:255"}

After restarting, the lease is renewed without worry :

2022-07-04 11:30:01	{"level":"info","ts":1656927001.1831055,"logger":"rabbitmq-cluster-operator","msg":"attempting to acquire leader lease k8s-system/rabbitmq-cluster-operator-leader-election...\n"}
2022-07-04 11:30:01	{"level":"info","ts":1656927001.512139,"logger":"messaging-topology-operator","msg":"attempting to acquire leader lease k8s-system/messaging-topology-operator-leader-election...\n"}
2022-07-04 11:30:17	{"level":"info","ts":1656927017.9897554,"logger":"rabbitmq-cluster-operator","msg":"successfully acquired lease k8s-system/rabbitmq-cluster-operator-leader-election\n"}
2022-07-04 11:30:20	{"level":"info","ts":1656927020.980656,"logger":"messaging-topology-operator","msg":"successfully acquired lease k8s-system/messaging-topology-operator-leader-election\n"}

Expected behavior Operator pods should not be restarted.

Version and environment information

RabbitMQ: 3.10.0
RabbitMQ Cluster Operator: 1.14.0
Kubernetes: 1.23.6
Cloud provider or hardware configuration: Scaleway, 3 nodes with 32Gb and 8 cores.

Jul 04 '22 11:07 bleleve

I'm not sure there's much we can do on our end, other than exposing the values to tune the leader election timeouts (as we already do).

This line is the root cause of the restart:

Put \"https://X.X.X.X:443/apis/coordination.k8s.io/v1/namespaces/k8s-system/leases/messaging-topology-operator-leader-election\": context deadline exceeded\n"

The Operator tries to do a PUT request to the Kubernetes API, and the Kubernetes API did not respond within the expected timeout.

You can use the following env variables in the Operator Deployment to tweak the leader election:

LEASE_DURATION as string, meaning seconds e.g. "15"
RENEW_DEADLINE as string, meaning seconds e.g. "10"
RETRY_PERIOD as string, meaning seconds e.g. "2"

To modify the Operator env vars, read this link.

I will leave a link to the manager package documenting the meaning of each:

https://github.com/kubernetes-sigs/controller-runtime/blob/365ae09c4c6c466edaa91c919f7654944057e0b6/pkg/manager/manager.go#L196-L205

These options are not exposed in the messaging topology operator. Please let us know if this resolves the problem in the Cluster Operator, and we will expose those in the Top-Op.

Jul 05 '22 10:07 Zerpet

Hello and thank you for your answer.

I will see with the technical team of our host if the problem is not related to the network. I will update this ticket as soon as a solution is found.

Thanks again.

Jul 05 '22 10:07 bleleve

cluster-operator cluster-operator copied to clipboard

RabbitMQ Operator Pods restarts every 3-4 days (leader election lost)

Describe the bug

To Reproduce

Version and environment information

cluster-operator
cluster-operator copied to clipboard