dlrover icon indicating copy to clipboard operation
dlrover copied to clipboard

The controller manager restarts frequently

Open sunjq1 opened this issue 1 year ago • 1 comments

I started the controller manager applying dlrover/go/operator/config/manifests/bases/deployment.yaml, but found that it restarts frequently.

[root@vadmin14 ~]# kubectl -n dlrover get po 
NAME                                          READY   STATUS    RESTARTS         AGE
dlrover-brain-5b866c8c44-n9cjp                1/1     Running   0                27h
dlrover-controller-manager-5884d84c4d-lz8th   2/2     Running   30 (2m50s ago)   27h
dlrover-kube-monitor-67c4ccf78d-lwmfv         1/1     Running   0                27h
mysql-6877845b96-j8sbg                        1/1     Running   0                27h

view logs:

[root@vadmin14 ~]# kubectl -n dlrover logs dlrover-controller-manager-5884d84c4d-lz8th -f
... ...
E1025 06:38:52.463386       1 leaderelection.go:330] error retrieving resource lock dlrover/9b6611a4.iml.github.io: Get "https://10.66.0.1:443/apis/coordination.k8s.io/v1/namespaces/dlrover/leases/9b6611a4.iml.github.io": context deadline exceeded
I1025 06:38:52.463492       1 leaderelection.go:283] failed to renew lease dlrover/9b6611a4.iml.github.io: timed out waiting for the condition
1.729838332463565e+09   ERROR   setup   problem running manager {"error": "leader election lost"}
main.main
        /workspace/main.go:119
runtime.main
        /usr/local/go/src/runtime/proc.go:250
1.7298383324636865e+09  INFO    Stopping and waiting for non leader election runnables

When I set leader-elect to false, the controller manager stopped restarting.

[root@vadmin14 ~]# kubectl -n dlrover edit deployments.apps dlrover-controller-manager 
... ...
spec:
  replicas: 1
... ...
      - args:
        - --health-probe-bind-address=:8081
        - --metrics-bind-address=127.0.0.1:8080
        - --leader-elect=false
... ...

So, why set leader-elect to true when the number of controller manager replicas is 1?

sunjq1 avatar Oct 26 '24 02:10 sunjq1

Maybe we need to increase the LeaseDuration or RenewDeadline like the issue https://github.com/operator-framework/operator-sdk/issues/1813#issuecomment-523713555

workingloong avatar Feb 18 '25 10:02 workingloong