redis-operator icon indicating copy to clipboard operation
redis-operator copied to clipboard

Replication: After failover, pod with the name of the previous master isn't get into the slave of the replication.

Open wkd-woo opened this issue 1 year ago • 11 comments

Hi there, hope you are doing well :)

I am a platform engineer in Korea, and I am working hard to analyze and test this Redis operator to apply it to our product environment. I was at a loss whether I should develop an operator myself, but I'm looking forward to making it easier with this project.

Anyway, I'm doing HA&Reconciling test from Replication to apply this to the stage, but it's different from what I expected, so I'm going to ask some questions.

I hope I did something wrong.



What version of redis operator are you using?

helm ls
NAME             	NAMESPACE     	REVISION	UPDATED                             	STATUS  	CHART                    	APP VERSION
redis-cluster    	redis-operator	1       	2024-02-14 17:43:41.633268 +0900 KST	deployed	redis-cluster-0.15.11    	0.15.1
redis-operator   	redis-operator	1       	2024-02-16 18:47:41.198563 +0900 KST	deployed	redis-operator-0.15.9    	0.15.1
redis-replication	redis-operator	2       	2024-02-16 18:00:11.414029 +0900 KST	deployed	redis-replication-0.15.11	0.15.1
redis-sentinel   	redis-operator	1       	2024-02-16 17:32:15.183352 +0900 KST	deployed	redis-sentinel-0.15.12   	0.15.1

redis-operator version: 0.15.1

Does this issue reproduce with the latest release?

it's the latest version.

What operating system and processor architecture are you using (kubectl version)?

 k get nodes -l "topology=sentinel" -o wide
NAME                STATUS   ROLES    AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
host-1   Ready    <none>   16d   v1.26.6   host-1   <none>        Ubuntu 20.04.6 LTS   5.4.0-167-generic   containerd://1.7.6
host-2   Ready    <none>   16d   v1.26.6   host-2   <none>        Ubuntu 20.04.6 LTS   5.4.0-167-generic   containerd://1.7.6
host-3   Ready    <none>   16d   v1.26.6   host-3   <none>        Ubuntu 20.04.6 LTS   5.4.0-167-generic   containerd://1.7.6

Ubuntu 20.04.6 LTS and amd64

kubectl version Output
$ kubectl version
Client Version: v1.28.4
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.26.6
WARNING: version difference between client (1.28) and server (1.26) exceeds the supported minor version skew of +/-1

What did you do?

$ k get all -l duty=operator
NAME                                  READY   STATUS    RESTARTS   AGE
pod/redis-operator-7688f78d4b-4fwx6   1/1     Running   0          53m
pod/redis-operator-7688f78d4b-pd2xs   1/1     Running   0          53m
pod/redis-operator-7688f78d4b-x9857   1/1     Running   0          53m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/redis-operator-7688f78d4b   3         3         3       53m

$ k get all -l topology=sentinel
NAME                            READY   STATUS    RESTARTS   AGE
pod/redis-replication-0         2/2     Running   0          44h
pod/redis-replication-1         2/2     Running   0          44h
pod/redis-replication-2         2/2     Running   0          44h
pod/redis-sentinel-sentinel-0   1/1     Running   0          54s
pod/redis-sentinel-sentinel-1   1/1     Running   0          52s
pod/redis-sentinel-sentinel-2   1/1     Running   0          50s

NAME                                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)             AGE
service/redis-replication                    ClusterIP   10.106.145.77   <none>        6379/TCP,9121/TCP   44h
service/redis-replication-additional         ClusterIP   10.110.154.69   <none>        6379/TCP            44h
service/redis-replication-headless           ClusterIP   None            <none>        6379/TCP            44h
service/redis-sentinel-sentinel              ClusterIP   10.96.154.206   <none>        26379/TCP           54s
service/redis-sentinel-sentinel-additional   ClusterIP   10.105.236.46   <none>        26379/TCP           54s
service/redis-sentinel-sentinel-headless     ClusterIP   None            <none>        26379/TCP           54s

NAME                                       READY   AGE
statefulset.apps/redis-replication         3/3     44h
statefulset.apps/redis-sentinel-sentinel   3/3     54s

of course, they're in the same namespace.


$ k exec -it pod/redis-sentinel-sentinel-0 -- redis-cli -p 26379 INFO Sentinel
# Sentinel
sentinel_masters:1
sentinel_tilt:0
sentinel_tilt_since_seconds:-1
sentinel_running_scripts:0
sentinel_scripts_queue_length:0
sentinel_simulate_failure_flags:0
master0:name=myMaster,status=ok,address=192.168.251.15:6379,slaves=2,sentinels=3


$ k exec -it pod/redis-replication-0 -- redis-cli INFO REPLICATION
Defaulted container "redis-replication" out of: redis-replication, redis-exporter
# Replication
role:master
connected_slaves:2
slave0:ip=192.168.52.201,port=6379,state=online,offset=345503,lag=1
slave1:ip=192.168.43.215,port=6379,state=online,offset=345503,lag=1
master_failover_state:no-failover
master_replid:4b6083c41c12eb6ffaf7923c38702303a1d03d33
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:345503
second_repl_offset:-1
repl_backlog_active:1
repl_backlog_size:27262976
repl_backlog_first_byte_offset:1
repl_backlog_histlen:345503

  1. Using the helm chart, I installed the redis-operator in the k8s cluster, and I installed sentinel, replication, and there were no configuration problems.


$ k exec -it pod/redis-replication-0 -- redis-cli ROLE
Defaulted container "redis-replication" out of: redis-replication, redis-exporter
1) "master"
2) (integer) 1084021
3) 1) 1) "192.168.52.201"
      2) "6379"
      3) "1084021"
   2) 1) "192.168.43.215"
      2) "6379"
      3) "1083735"

$ k delete pod/redis-replication-0
pod "redis-replication-0" deleted

  1. In order to conduct the failover test, pod/replication-0 with master role was deleted from topology and failed over successfully to pod/replication-1 by sentinel.
$ k exec -it pod/redis-replication-0 -- redis-cli ROLE
Defaulted container "redis-replication" out of: redis-replication, redis-exporter
1) "master"
2) (integer) 0
3) (empty array)

$ k exec -it pod/redis-replication-1 -- redis-cli ROLE
Defaulted container "redis-replication" out of: redis-replication, redis-exporter
1) "master"
2) (integer) 1164513
3) 1) 1) "192.168.43.215"
      2) "6379"
      3) "1164084"

$ k exec -it pod/redis-replication-2 -- redis-cli ROLE
Defaulted container "redis-replication" out of: redis-replication, redis-exporter
1) "slave"
2) "192.168.52.201"
3) (integer) 6379
4) "connected"
5) (integer) 1165671

$ k exec -it pod/redis-replication-1 -- redis-cli INFO REPLICATION
Defaulted container "redis-replication" out of: redis-replication, redis-exporter
# Replication
role:master
connected_slaves:1
slave0:ip=192.168.43.215,port=6379,state=online,offset=1600959,lag=1
master_failover_state:no-failover
master_replid:981ab552aabdb90c981b06e9ef4730020e9e9273
master_replid2:4b6083c41c12eb6ffaf7923c38702303a1d03d33
master_repl_offset:1600959
second_repl_offset:1100007
repl_backlog_active:1
repl_backlog_size:27262976
repl_backlog_first_byte_offset:15
repl_backlog_histlen:1600945
 k logs -f pod/redis-replication-0
Defaulted container "redis-replication" out of: redis-replication, redis-exporter
Redis is running without password which is not recommended
Setting up redis in standalone mode
Running without TLS mode
Starting redis service in standalone mode.....
8:C 16 Feb 2024 09:38:16.167 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
8:C 16 Feb 2024 09:38:16.167 # Redis version=7.0.5, bits=64, commit=00000000, modified=0, pid=8, just started
8:C 16 Feb 2024 09:38:16.167 # Configuration loaded
8:M 16 Feb 2024 09:38:16.167 * monotonic clock: POSIX clock_gettime
8:M 16 Feb 2024 09:38:16.167 * Running mode=standalone, port=6379.
8:M 16 Feb 2024 09:38:16.167 # Server initialized
8:M 16 Feb 2024 09:38:16.167 * Ready to accept connections

pod/replicaiton-0 is standalone master.


  1. Since then, pod/replication-0 has been regenerated, but it is not included in replication and remains the standard master.

and in my guess, operator can following replication and did nothing about it.

here's the logs of the operator.

k logs -f replicaset.apps/redis-operator-7688f78d4b
Found 3 pods, using pod/redis-operator-7688f78d4b-h4hdr
{"level":"info","ts":1707900179.925099,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1707900179.9253988,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1707900179.9256382,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":1707900179.9256592,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
I0214 08:42:59.925670       1 leaderelection.go:248] attempting to acquire leader lease redis-operator/6cab913b.redis.opstreelabs.in...
{"level":"info","ts":1708076825.037801,"msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":1708076825.0378394,"msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":1708076825.03794,"logger":"controller.redis","msg":"Starting EventSource","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"Redis","source":"kind source: *v1beta2.Redis"}
{"level":"info","ts":1708076825.0379727,"logger":"controller.redis","msg":"Starting Controller","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"Redis"}
{"level":"info","ts":1708076825.0379717,"logger":"controller.rediscluster","msg":"Starting EventSource","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"RedisCluster","source":"kind source: *v1beta2.RedisCluster"}
{"level":"info","ts":1708076825.037995,"logger":"controller.rediscluster","msg":"Starting Controller","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"RedisCluster"}
{"level":"info","ts":1708076825.0380416,"logger":"controller.redissentinel","msg":"Starting EventSource","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"RedisSentinel","source":"kind source: *v1beta2.RedisSentinel"}
{"level":"info","ts":1708076825.038054,"logger":"controller.redissentinel","msg":"Starting Controller","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"RedisSentinel"}
{"level":"error","ts":1708076825.0380597,"logger":"controller.redissentinel","msg":"Could not wait for Cache to sync","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"RedisSentinel","error":"failed to wait for redissentinel caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218"}
{"level":"error","ts":1708076825.0380514,"logger":"controller.redis","msg":"Could not wait for Cache to sync","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"Redis","error":"failed to wait for redis caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218"}
{"level":"error","ts":1708076825.0381067,"msg":"error received after stop sequence was engaged","error":"failed to wait for redissentinel caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541"}
{"level":"error","ts":1708076825.0381339,"msg":"error received after stop sequence was engaged","error":"failed to wait for redis caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541"}
{"level":"error","ts":1708076825.038145,"logger":"controller.rediscluster","msg":"Could not wait for Cache to sync","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"RedisCluster","error":"failed to wait for rediscluster caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218"}
{"level":"error","ts":1708076825.0381563,"msg":"error received after stop sequence was engaged","error":"failed to wait for rediscluster caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541"}
{"level":"info","ts":1708076825.0381808,"logger":"controller.redisreplication","msg":"Starting EventSource","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"RedisReplication","source":"kind source: *v1beta2.RedisReplication"}
{"level":"info","ts":1708076825.038193,"logger":"controller.redisreplication","msg":"Starting Controller","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"RedisReplication"}
{"level":"error","ts":1708076825.0381975,"logger":"controller.redisreplication","msg":"Could not wait for Cache to sync","reconciler group":"redis.redis.opstreelabs.in","reconciler kind":"RedisReplication","error":"failed to wait for redisreplication caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:208\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:234\nsigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/runnable_group.go:218"}
{"level":"info","ts":1708076825.038211,"msg":"Stopping and waiting for caches"}
{"level":"error","ts":1708076825.0382483,"msg":"error received after stop sequence was engaged","error":"failed to wait for redisreplication caches to sync: timed out waiting for cache to be synced","stacktrace":"sigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:541"}
{"level":"info","ts":1708076825.038272,"msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":1708076825.0382922,"msg":"Wait completed, proceeding to shutdown the manager"}

What did you expect to see? I expected to see standalone master(pod with the name of the previous master) getting into the replication.

What did you see instead? but it didn't. still remain in standalone master.

wkd-woo avatar Feb 16 '24 10:02 wkd-woo

I believe I had a similar experience, but in cluster mode. I was testing replication in clusters, and deleted one of the masters. The remaning nodes properly failed over to a follower, however, once the leader was restored, it was still listed as master,fail by cluster nodes

01858797a3a6c45049469467b8546fa5e5986099 10.244.1.52:6379@16379 slave b38b750af35568ccf7c58168fd5a24a7dc979037 0 1708217007583 4 connected
b38b750af35568ccf7c58168fd5a24a7dc979037 10.244.8.87:6379@16379 master - 0 1708217007585 4 connected 5461-10922
bab8c97d4348f03083806fec03a7f4f3fc5a8c08 10.244.2.230:6379@16379 slave d3f40d6627ed74d87583700480d6fcc39a214f0e 0 1708217007084 1 connected
e9a0a511775ec502173b933fe9a2b15d165012eb 10.244.10.143:6379@16379 slave 6f3de0912b1539fdb927f7d7631510748e37808c 0 1708217008086 3 connected
0e61980cb13c27f324a911280290659f0ac03a4a 10.244.3.63:6379@16379 slave 6f3de0912b1539fdb927f7d7631510748e37808c 0 1708217008590 3 connected
d3f40d6627ed74d87583700480d6fcc39a214f0e 10.244.8.86:6379@16379 master - 0 1708217007084 1 connected 0-5460
494de123674917d18dbb7fd03214434220c3058d 10.244.3.62:6379@16379 master,fail - 1708166684557 1708166682000 2 connected
6f3de0912b1539fdb927f7d7631510748e37808c 10.244.2.229:6379@16379 master - 0 1708217007583 3 connected 10923-16383
a5cfb5b3e9c7db59793843febd25661bc0745559 10.244.9.221:6379@16379 myself,slave d3f40d6627ed74d87583700480d6fcc39a214f0e 0 1708217006000 1 connected

I loaded in the newly generated master and ran cluster nodes on it to see what it was seeing, and it was seeing itself as a standalone master.

127.0.0.1:6379> cluster nodes
c0d02edb263bc4891047fa3237d37129f920d459 :6379@16379 myself,master - 0 0 0 connected

I suspect there is some issue here with the operator being unable to reconcile a node back into the cluster/being able to remove an old node from the cluster upon deletion.

I will note, I am specifically disabling persistence on these nodes, which may be a part of the issue. I encountered what I think are separate issues caused by persistence in #773 but it may also be symptoms of the same problem.

deefdragon avatar Feb 18 '24 00:02 deefdragon

@deefdragon I also disabled the persistence option. But I understand that the option is only a configuration related to rdb or aof. (Of course I could be wrong)

How about volume? I am using local volume storage for PM

wkd-woo avatar Feb 18 '24 11:02 wkd-woo

I think we found the cause of this problem.. The operator keeps trying to elect a leader.

operator can't go to the next stage because they couldn't choose the leader

$ k logs -f deployment.apps/redis-operator
Found 3 pods, using pod/redis-operator-7688f78d4b-gshwc
{"level":"info","ts":1708256654.5194058,"logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1708256654.519777,"logger":"setup","msg":"starting manager"}
{"level":"info","ts":1708256654.5200722,"msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"info","ts":1708256654.5201178,"msg":"Starting server","kind":"health probe","addr":"[::]:8081"}
I0218 11:44:14.520161       1 leaderelection.go:248] attempting to acquire leader lease redis-operator/6cab913b.redis.opstreelabs.in...

wkd-woo avatar Feb 18 '24 11:02 wkd-woo

and here's my value settings.

redisOperator:
  name: redis-operator
  imageName: { $REDIS-OPERATOR-REPOSITORY }
  imageTag: "v0.15.1-amd64"
  imagePullPolicy: IfNotPresent

  podAnnotations: {}
  podLabels:
    duty: operator
    service: jude-service

  extraArgs: []

  watch_namespace: "redis-operator"
  env: []
  webhook: false

resources:
  limits:
    cpu: 500m
    memory: 500Mi
  requests:
    cpu: 500m
    memory: 500Mi

replicas: 3

serviceAccountName: redis-operator

service:
  name: webhook-service
  namespace: redis-operator

certificate:
  name: serving-cert
  secretName: webhook-server-cert

issuer:
  type: selfSigned
  name: redis-operator-issuer
  email: [email protected]
  server: https://acme-v02.api.letsencrypt.org/directory
  privateKeySecretName: letsencrypt-prod
  solver:
    enabled: true
    ingressClass: nginx

certmanager:
  enabled: false

priorityClassName: ""
nodeSelector:
  kubernetes.io/arch: amd64
  kubernetes.io/os: linux
  dbservice: redis
  topology: operator
  duty: operator

tolerateAllTaints: false
tolerations: []
affinity: {}

wkd-woo avatar Feb 18 '24 11:02 wkd-woo

Thank you, @wkd-woo, for your feedback. We should support the old master redis to rejoin the cluster after failover.

drivebyer avatar Mar 07 '24 03:03 drivebyer

I am exploring redis sentinel for my team's requirement, I have noted the same issue. Failover is not working as expected in Sentinel mode.

Here is my environment set up:

  1. Replication - cluster size 3 (10.24.1.10, 10.24.1.11, 10.24.1.12)
  2. Sentinel - cluster size 3 (10.24.2.10, 10.24.2.11, 10.24.2.12)
  3. Exposed Replication pods with LoadBalancer service and able to connect to cluster using redis-cli -h 10.23.4.88
  4. everything works as expected
10.23.4.88:6379> role
1) "master"
2) (integer) 42
3) 1) 1) "10.24.1.11"
      2) "6379"
      3) "42"
   2) 1) "10.24.1.12"
      2) "6379"
      3) "42"

10.23.4.88:6379> info replication
# Replication
role:master
connected_slaves:2
slave0:ip=10.24.1.11,port=6379,state=online,offset=70,lag=1
slave1:ip=10.24.1.12,port=6379,state=online,offset=70,lag=0
....
  1. Killed/deleted the Master pod, expecting one of the slave to take over the Master role.
  2. But Sentinel waits till new Master node comes up (instead of failover to one of the Slave node) and assigns the Master IP once the new master is UP ex:
redis-cli -h 10.23.11.164 -p 26379 sentinel get-master-addr-by-name myMaster
1) "10.24.4.214" # new master pod IP
2) "6379"
  1. Since the existing slaves have old Pod IP reference (i.e. 10.24.1.10 in this case), Sync fails as following
# Timeout connecting to the MASTER...
* Reconnecting to MASTER 10.24.1.10:6379 after failure
  1. To move on, we need to restart the Slave PODs, so it will pick up the new Master POD IP

Are there any configuration which I need to set explicitly to make this failover working ? It seems like, I cannot go ahead with this operator until this issue is fixed.

maheshglm avatar Mar 26 '24 15:03 maheshglm

Hey @maheshglm! The issue stems from the redis sentinel controller reconciliation mechanism. Redis sentinel reconciles every 10 seconds. When you kill a pod, it becomes the master pod. The sentinel then monitors the newly created master pod during the restart process.

I think this issue has been temporarily fixed with the recent changes made at https://github.com/OT-CONTAINER-KIT/redis-operator/pull/803/files.

drivebyer avatar Mar 27 '24 02:03 drivebyer

@drivebyer Thanks for the quick response. It seems the latest code is not tagged. Let me deploy operator from master branch and test the same flow again. Thanks again.

maheshglm avatar Mar 27 '24 03:03 maheshglm

@drivebyer I have built the operator from master and deployed the new image. But I still see the same behaviour when deleted the master pod

  1. The redis replication controller notices that its desired state is 3 replicas, but there are currently only 2 replicas, and so it creates a new redis server to bring the replica count back up to 3 -- Its working

  2. The redis sentinels themselves, realize that the redis master has disappeared from the cluster, and begin the election procedure for selecting a new master. They perform this election and selection, and chose one of the existing redis server replicas to be the new master. -- This is not working

maheshglm avatar Mar 27 '24 04:03 maheshglm

Hey @maheshglm! Could you please verify if the sentinel pod is recreated after you kill the master pod? If the sentinel pod is not created, that's the expected behavior.

drivebyer avatar Mar 27 '24 05:03 drivebyer

I deployed a new version of the operator with the fixes in #803 and that fixed the failover at least for me. Sentinel reacts to dropped master and elects a new one, and the replacing pod will be configured as replica when it's back up.

jmtsi avatar Mar 27 '24 09:03 jmtsi