redis-operator icon indicating copy to clipboard operation
redis-operator copied to clipboard

With automatic master failover in RedisReplication using Redis Sentinel (delete master)

Open abix5 opened this issue 1 year ago • 2 comments

redis-operator version: 0.18.0

Does this issue reproduce with the latest release?

Yes, the issue reproduces on version 0.18.0.

What operating system and processor architecture are you using (kubectl version)?

kubectl version Output
$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.27.3

What did you do?

I used the following configuration file to deploy Redis Replication and Redis Sentinel:

After applying these configurations, all pods start successfully.


---
apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisReplication
metadata:
  name: redis-replication
  namespace: redis-operator-ot
spec:
  clusterSize: 3
  podSecurityContext:
    runAsUser: 1000
    fsGroup: 1000
  kubernetesConfig:
    image: quay.io/opstree/redis:v7.0.12
    imagePullPolicy: IfNotPresent
    resources:
      requests:
        cpu: 101m
        memory: 128Mi
      limits:
        cpu: 101m
        memory: 128Mi
---
apiVersion: redis.redis.opstreelabs.in/v1beta2
kind: RedisSentinel
metadata:
  name: redis-sentinel
  namespace: redis-operator-ot
spec:
  clusterSize: 3
  podSecurityContext:
    runAsUser: 1000
    fsGroup: 1000
  pdb:
    enabled: false
    minAvailable: 1
  redisSentinelConfig:
    redisReplicationName: redis-replication
  kubernetesConfig:
    image: quay.io/opstree/redis-sentinel:v7.0.12
    imagePullPolicy: IfNotPresent
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 100m
        memory: 128Mi

What did you expect to see?

I expected that when the Redis master pod is deleted, the cluster would automatically perform a master re-election, and Redis Sentinel would handle this process correctly without any errors.

What did you see instead?

When I deleted the Redis master pod, I received the following error in the logs:

{"level":"info","ts":"2024-09-27T17:12:21Z","logger":"controllers.RedisSentinel","msg":"Reconciling opstree redis controller","Request.Namespace":"redis-operator-ot","Request.Name":"redis-sentinel"} {"level":"error","ts":"2024-09-27T17:12:22Z","logger":"controllers.RedisSentinel","msg":"","Request.Namespace":"redis-operator-ot","Request.Name":"redis-sentinel","error":"no master pods found","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisReplicationMasterIP\n\t/workspace/k8sutils/redis-sentinel.go:331\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.IsRedisReplicationReady\n\t/workspace/k8sutils/redis-replication.go:224\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisSentinelReconciler).Reconcile\n\t/workspace/controllers/redissentinel_controller.go:54\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:316\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}

The cluster does not recover automatically, and Redis Sentinel cannot find a new master. To bring the cluster out of this state, I have to delete all the pods in RedisReplication simultaneously, after which the cluster reinitializes and starts working again.

Please help me figure out this issue. Maybe I am missing something in the configuration, or there is a bug in the operator when handling master failover.

abix5 avatar Sep 27 '24 17:09 abix5

related to https://github.com/OT-CONTAINER-KIT/redis-operator/issues/802

drivebyer avatar Oct 08 '24 10:10 drivebyer

The issue seems to be that redis master pod IP is being passed as an env variable to sentinels STS. When master pod got restarted, it gets new IP address and the one on Sentinels is never updated.

Or I'm getting it wrong how failover should work 😅

EDIT: this might actually be my ciliumnetworkPolicy fault, but on kind cluster I can reproduce quite consistently a scenario, where all 3 redis replicas are marked as a slave w/o master and I can't recover from this state.

michalschott avatar Oct 10 '24 14:10 michalschott

After building a new image with changes from the master branch, an attempt was made to delete the pod with the master Redis. However, the failover issue in Redis Sentinel persists, and the same errors as before are observed in the Sentinel logs. The redis-operator pod also shows the same error.

Sentinel is running without password which is not recommended
Running sentinel without TLS mode
ACL_MODE is not true, skipping ACL file modification
Starting  sentinel service .....
9:X 07 Nov 2024 11:12:30.027 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
9:X 07 Nov 2024 11:12:30.027 * Redis version=7.2.1, bits=64, commit=00000000, modified=0, pid=9, just started
9:X 07 Nov 2024 11:12:30.027 * Configuration loaded
9:X 07 Nov 2024 11:12:30.027 * monotonic clock: POSIX clock_gettime
9:X 07 Nov 2024 11:12:30.028 # Failed to write PID file: Permission denied
9:X 07 Nov 2024 11:12:30.028 * Running mode=sentinel, port=26379.
9:X 07 Nov 2024 11:12:30.041 * Sentinel new configuration saved on disk
9:X 07 Nov 2024 11:12:30.041 * Sentinel ID is 6f59ecd4933ff536fc31c6ddbd4285f4f6b5f049
9:X 07 Nov 2024 11:12:30.041 # +monitor master myMaster 10.3.129.99 6379 quorum 2
9:X 07 Nov 2024 11:12:30.126 * +slave slave 10.97.46.21:6379 10.97.46.21 6379 @ myMaster 10.3.129.99 6379
9:X 07 Nov 2024 11:12:30.138 * Sentinel new configuration saved on disk
9:X 07 Nov 2024 11:12:30.138 * +slave slave 10.3.129.101:6379 10.3.129.101 6379 @ myMaster 10.3.129.99 6379
9:X 07 Nov 2024 11:12:30.144 * Sentinel new configuration saved on disk
9:X 07 Nov 2024 11:12:32.040 * +sentinel sentinel 065f70f1d1d7b2896087caaed07a83b2b05d4c4d 10.3.132.142 26379 @ myMaster 10.3.129.99 6379
9:X 07 Nov 2024 11:12:32.049 * Sentinel new configuration saved on disk
9:X 07 Nov 2024 11:12:33.646 * +sentinel sentinel c202e2c56ccc910fb6a0284c76c86deddcf6d273 10.3.129.104 26379 @ myMaster 10.3.129.99 6379
9:X 07 Nov 2024 11:12:33.651 * Sentinel new configuration saved on disk
9:X 07 Nov 2024 11:12:35.158 # +sdown slave 10.97.46.21:6379 10.97.46.21 6379 @ myMaster 10.3.129.99 6379

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
master down
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
9:X 07 Nov 2024 11:16:00.771 # +sdown master myMaster 10.3.129.99 6379
9:X 07 Nov 2024 11:16:00.917 * Sentinel new configuration saved on disk
9:X 07 Nov 2024 11:16:00.917 # +new-epoch 1
9:X 07 Nov 2024 11:16:00.926 * Sentinel new configuration saved on disk
9:X 07 Nov 2024 11:16:00.926 # +vote-for-leader 065f70f1d1d7b2896087caaed07a83b2b05d4c4d 1
9:X 07 Nov 2024 11:16:01.863 # +odown master myMaster 10.3.129.99 6379 #quorum 3/2
9:X 07 Nov 2024 11:16:01.863 * Next failover delay: I will not start a failover before Thu Nov  7 11:16:21 2024
9:X 07 Nov 2024 11:16:02.031 # +config-update-from sentinel 065f70f1d1d7b2896087caaed07a83b2b05d4c4d 10.3.132.142 26379 @ myMaster 10.3.129.99 6379
9:X 07 Nov 2024 11:16:02.032 # +switch-master myMaster 10.3.129.99 6379 10.3.129.101 6379
9:X 07 Nov 2024 11:16:02.032 * +slave slave 10.97.46.21:6379 10.97.46.21 6379 @ myMaster 10.3.129.101 6379
9:X 07 Nov 2024 11:16:02.032 * +slave slave 10.3.129.99:6379 10.3.129.99 6379 @ myMaster 10.3.129.101 6379
9:X 07 Nov 2024 11:16:02.046 * Sentinel new configuration saved on disk
9:X 07 Nov 2024 11:16:07.067 # +sdown slave 10.97.46.21:6379 10.97.46.21 6379 @ myMaster 10.3.129.101 6379
9:X 07 Nov 2024 11:16:07.067 # +sdown slave 10.3.129.99:6379 10.3.129.99 6379 @ myMaster 10.3.129.101 6379
9:X 07 Nov 2024 11:16:37.210 # +sdown master myMaster 10.3.129.101 6379
9:X 07 Nov 2024 11:16:37.269 # +odown master myMaster 10.3.129.101 6379 #quorum 2/2
9:X 07 Nov 2024 11:16:37.269 # +new-epoch 2
9:X 07 Nov 2024 11:16:37.269 # +try-failover master myMaster 10.3.129.101 6379
9:X 07 Nov 2024 11:16:37.277 * Sentinel new configuration saved on disk
9:X 07 Nov 2024 11:16:37.277 # +vote-for-leader 6f59ecd4933ff536fc31c6ddbd4285f4f6b5f049 2
9:X 07 Nov 2024 11:16:37.299 * 065f70f1d1d7b2896087caaed07a83b2b05d4c4d voted for 6f59ecd4933ff536fc31c6ddbd4285f4f6b5f049 2
9:X 07 Nov 2024 11:16:37.302 * c202e2c56ccc910fb6a0284c76c86deddcf6d273 voted for 6f59ecd4933ff536fc31c6ddbd4285f4f6b5f049 2
9:X 07 Nov 2024 11:16:37.349 # +elected-leader master myMaster 10.3.129.101 6379
9:X 07 Nov 2024 11:16:37.349 # +failover-state-select-slave master myMaster 10.3.129.101 6379
9:X 07 Nov 2024 11:16:37.421 # -failover-abort-no-good-slave master myMaster 10.3.129.101 6379
9:X 07 Nov 2024 11:16:37.512 * Next failover delay: I will not start a failover before Thu Nov  7 11:16:57 2024

IP Redis Replicas:

  • 10.3.129.99 # master
  • 10.3.129.141
  • 10.3.129.101

New IP pod after delete master: 10.3.129.105

abix5 avatar Nov 07 '24 11:11 abix5

I also tested the sentinel by killing the master pod. It seems the problem still persists. When I check redis pods I see 2 master nodes in the cluster. How is it possible that his critical problem has not been fixed for almost a year?

zugao avatar Sep 03 '25 08:09 zugao