redis-operator Redis Replication with Sentinels configure empty master after statefulset rollout

What version of redis operator are you using?

v0.14.0

{"level":"error","ts":1686838587.4277818,"logger":"controller_redis","msg":"Error in getting redis pod IP","Request.RedisManager.Namespace":"omri-test","Request.RedisManager.Name":"","error":"resource name may not be empty","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisReplicationMasterIP\n\t/workspace/k8sutils/redis-sentinel.go:309\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getSentinelEnvVariable\n\t/workspace/k8sutils/redis-sentinel.go:239\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.generateRedisSentinelContainerParams\n\t/workspace/k8sutils/redis-sentinel.go:145\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.RedisSentinelSTS.CreateRedisSentinelSetup\n\t/workspace/k8sutils/redis-sentinel.go:72\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.CreateRedisSentinel\n\t/workspace/k8sutils/redis-sentinel.go:45\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisSentinelReconciler).Reconcile\n\t/workspace/controllers/redissentinel_controller.go:50\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":1686838587.4279523,"logger":"controller_redis","msg":"Successfully got the ip for redis","Request.RedisManager.Namespace":"omri-test","Request.RedisManager.Name":"","ip":""}

What operating system and processor architecture are you using (kubectl version)?

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"darwin/arm64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.8", GitCommit:"a12b886b1da059e0190c54d09c5eab5219dd7acf", GitTreeState:"clean", BuildDate:"2022-06-16T05:51:36Z", GoVersion:"go1.17.11", Compiler:"gc", Platform:"linux/arm64"}

The node that the operator was running on was amd64.

What did you do?

We want to start using HA redis with sentinels in our k8s clusters. I finished setting everything up (replication, sentinels...), but after some testing I found out that every time I rollout all of our Redis Replication pods, either from a manual rollout restart or a change to a field in the RedisReplication custom resource, the whole setup would stop working because the sentinels and the replications would not set the correct master.

Deleting just the master pod manually worked fine, but 80% of the time, restarting all pods would cause issues.

This is probably related to the checkAttachedSlave function. The master redis instances would have no slaves after the restarts, and the sentinels would default back to their default IP (0.0.0.0).

Redis replication spec:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: role
                operator: In
                values:
                  - platform
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - test-sentinel-redis
            topologyKey: kubernetes.io/hostname
          weight: 50
  clusterSize: 3
  kubernetesConfig: {...}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  priorityClassName: datastores
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  redisExporter:
    enabled: true
    env:
      - name: REDIS_EXPORTER_INCL_SYSTEM_METRICS
        value: 'true'
    image: '<our mirror image repo>/redis-exporter:v1.45.0'
    imagePullPolicy: Always
  securityContext:
    runAsUser: 0
  tolerations:
    - effect: NoSchedule
      key: role
      operator: Equal
      value: platform

Redis Sentinel spec:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: role
                operator: In
                values:
                  - platform
  clusterSize: 3
  kubernetesConfig: {...}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  priorityClassName: datastores
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  redisSentinelConfig:
    downAfterMilliseconds: '30000'
    failoverTimeout: '180000'
    masterGroupName: myMaster
    parallelSyncs: '1'
    quorum: '2'
    redisPort: '6379'
    redisReplicationName: test-sentinel-redis
  securityContext:
    runAsUser: 0
  tolerations:
    - effect: NoSchedule
      key: role
      operator: Equal
      value: platform

What did you expect to see?

After the restart, I would expect to see that the redis sentinel statefulset would be updated with the correct ENV_VARS for the master redis PORT. I would expect the redis replicas to recognize each other as master/slaves.

What did you see instead? The redis sentinel statefulset was created with these env vars:

...
- name: MASTER_GROUP_NAME
  value: myMaster
- name: IP
- name: PORT
  value: "6379"
- name: QUORUM
  value: "2"
...

The empty IP would be configured for the redis instances and I would see these logs:

test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:46.867 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:46.867 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:47.869 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:47.869 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:48.871 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:48.871 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:49.873 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:49.873 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:50.875 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:50.875 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:51.877 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:51.877 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:52.879 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:52.879 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:53.881 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:53.881 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-1 test-sentinel-redis 8:S 15 Jun 2023 14:37:54.512 * Connecting to MASTER :6379
test-sentinel-redis-1 test-sentinel-redis 8:S 15 Jun 2023 14:37:54.512 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:54.883 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:54.883 # Unable to connect to MASTER: Invalid argument

Running redis-cli info on one of the slaves:

root@test-sentinel-redis-0:/data# redis-cli
127.0.0.1:6379> info Replication
# Replication
role:slave
master_host:
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_read_repl_offset:0
slave_repl_offset:0
master_link_down_since_seconds:-1
slave_priority:100
slave_read_only:1
replica_announced:1
connected_slaves:0
master_failover_state:no-failover
master_replid:c6c32fbe5260385663832cc0018538b78636dd6b
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

Running redis-cli info on the master:

root@test-sentinel-redis-2:/data# redis-cli
127.0.0.1:6379> info Replication
# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:541d07b48b19e68a56530cc0b3aee1cb096937e5
master_replid2:003e8e82cfe53ec5e7286432d7a9919254ffcaa0
master_repl_offset:61464
second_repl_offset:53029
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:41179
repl_backlog_histlen:20286
127.0.0.1:6379>

Jun 15 '23 14:06 omrirosner-clx

btw I checked it with the most recent version of v0.15.0 as well, and the issues are similar. Except I keep getting the log no master pods found from the redis operator. It also looks like non of the redis replication pods is being set to master. all 3 of them are slaves when using this new version

Jun 18 '23 08:06 omrirosner-clx

@omrirosner-clx Thanks for reporting this issue. There seems to be a problem while roll out That should be fixed

Jul 05 '23 07:07 shubham-cmyk

hey, any update on this? @shubham-cmyk

Aug 01 '23 08:08 omrirosner-clx

My schedule was busy I would try fixing this bug this week. Sorry for the delay. I am putting this on the priority basis

Aug 08 '23 08:08 shubham-cmyk

Any updates on this @shubham-cmyk ?

Sep 22 '23 06:09 steintore

Any news on this? Running into that issue too it's currently blocking our production updates as every change might cause a disfunctional redis here and there.

Oct 11 '23 09:10 landorg

@landorg Which redis cr setup you are using ?

Oct 11 '23 19:10 shubham-cmyk

@shubham-cmyk redisreplication and redissentinel:

apiVersion: redis.redis.opstreelabs.in/v1beta1
kind: RedisReplication
metadata:
  annotations:
    meta.helm.sh/release-name: bms-redis-persistent-replication
    meta.helm.sh/release-namespace: bms-databases
  creationTimestamp: "2023-09-25T17:34:47Z"
  finalizers:
  - redisReplicationFinalizer
  generation: 3
  labels:
    app.kubernetes.io/component: middleware
    app.kubernetes.io/instance: bms-redis-persistent-replication
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: bms-redis-persistent-replication
    app.kubernetes.io/version: 0.15.0
    helm.sh/chart: redis-replication-0.15.3
  name: bms-redis-persistent-replication
  namespace: bms-databases
  resourceVersion: "675854871"
  uid: 87e3850a-ee6a-452f-8d92-a945fdbbd1df
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - redis-sentinel
        topologyKey: kubernetes.io/hostname
  clusterSize: 3
  kubernetesConfig:
    image: quay.io/opstree/redis:v7.0.5
    imagePullPolicy: IfNotPresent
    redisSecret:
      key: password
      name: bms-redis-persistent-secret
    resources:
      limits:
        memory: 3Gi
      requests:
        cpu: 1m
        memory: 1Mi
    updateStrategy: {}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  nodeSelector:
    deploy/dbs: "true"
  podSecurityContext:
    fsGroup: 1000
    runAsUser: 1000
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  redisConfig:
    additionalRedisConfig: bms-redis-persistent-replication-ext-config
  redisExporter:
    image: quay.io/opstree/redis-exporter:v1.44.0
    imagePullPolicy: IfNotPresent
    resources: {}
  storage:
    volumeClaimTemplate:
      metadata: {}
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
        storageClassName: local-path
      status: {}
    volumeMount: {}

apiVersion: redis.redis.opstreelabs.in/v1beta1
kind: RedisSentinel
metadata:
  annotations:
    meta.helm.sh/release-name: bms-redis-persistent-sentinel
    meta.helm.sh/release-namespace: bms-databases
  creationTimestamp: "2023-09-25T17:34:55Z"
  finalizers:
  - redisSentinelFinalizer
  generation: 2
  labels:
    app.kubernetes.io/component: middleware
    app.kubernetes.io/instance: bms-redis-persistent-sentinel
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: bms-redis-persistent-sentinel
    app.kubernetes.io/version: 0.15.0
    helm.sh/chart: redis-sentinel-0.15.3
  name: bms-redis-persistent-sentinel
  namespace: bms-databases
  resourceVersion: "660146071"
  uid: e9bd5d08-400b-4e20-b42a-6a3f7d7ea771
spec:
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - redis-sentinel
        topologyKey: kubernetes.io/hostname
  clusterSize: 3
  kubernetesConfig:
    image: quay.io/opstree/redis-sentinel:v7.0.7
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        memory: 256Mi
      requests:
        cpu: 1m
        memory: 1Mi
    updateStrategy: {}
  livenessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  nodeSelector:
    deploy/dbs: "true"
  podSecurityContext:
    fsGroup: 1000
    runAsUser: 1000
  readinessProbe:
    failureThreshold: 3
    initialDelaySeconds: 1
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  redisSentinelConfig:
    additionalSentinelConfig: bms-redis-persistent-sentinel-ext-config
    downAfterMilliseconds: "30000"
    failoverTimeout: "180000"
    masterGroupName: bms-redis-persistent
    parallelSyncs: "1"
    quorum: "2"
    redisPort: "6379"
    redisReplicationName: bms-redis-persistent-replication

Oct 12 '23 08:10 landorg

redis.redis.opstreelabs.in/v1beta1

I don't think this should work on the v1beta1 can you try this with v1beta2 For that keep in mind

Operator version is 0.15.1 and the Redis image is v7.0.12

Oct 12 '23 08:10 shubham-cmyk

I don't think this should work on the v1beta1 can you try this with v1beta2

oh. I see. Interesting. How should we know? https://ot-redis-operator.netlify.app/docs/getting-started/sentinel/ https://ot-redis-operator.netlify.app/docs/getting-started/replication/

For redis sentinel what image should we use? Also v7.0.12 ? Is there a place where we can find this?

Thank you

Oct 12 '23 09:10 landorg

Ah and we are using the helm charts to setup the crs where beta1 is hardcoded unfortunately: https://github.com/OT-CONTAINER-KIT/helm-charts/blob/main/charts/redis-sentinel/templates/redis-sentinel.yaml#L2 https://github.com/OT-CONTAINER-KIT/helm-charts/blob/main/charts/redis-replication/templates/redis-replication.yaml#L2

Oct 12 '23 09:10 landorg

I have also encountered this problem several times with the newest 0.16.0 operator version. I also use redisreplication + sentinel setup and in some cases the operator configures replication with an empty IP and the cluster ends up in a broken state and can't heal without manually rolling the whole cluster.

Apr 08 '24 15:04 jmtsi

Hi @shubham-cmyk. Any update on this? This is a dealbreaker for us as we already run these instances in production. We can't change anything until this is fixed.

Apr 16 '24 13:04 landorg

@landorg I ditched this operator and decided to just use plain simple statefulsets instead, configuring the replication between nodes after first boot and letting Sentinel take over after that.

Using pod disruption budgets and other k8s primitives is enough to make the setup stable and fault-tolerant - the operator only breaks things in it's current form and adds unnecessary delays and Sentinel restarts.

There is a use case for the operator if creating a cluster-style setup, but for replication+sentinel, it's not really needed at all.

Apr 16 '24 16:04 jmtsi

I'm on Operator 0.15.9 and Image 7.0.12, and bumping up against this exact issue right now.

The 0.16.0 operator has been released but the Helm chart hasn't been updated, so I guess I'm sort of stuck with manual intervention for the time being.

I had to issue replicaof <ip> <port> commands on my replicas to get over the issue.

Apr 30 '24 17:04 nathan-bowman

I am also hitting this issue. I was able to fix my cluster using command from @nathan-bowman 's comment I'm on Operator 0.15.1 and image 7.0.12

May 01 '24 13:05 fortra-cloudops-platform

I have same issue on Operator v0.16.0 and redis v7.0.12 & v7.2.3

May 09 '24 08:05 wkd-woo

Operator v0.17.0 released today, still waiting on the Helm chart to get updated...

May 14 '24 14:05 nathan-bowman

Hello @shubham-cmyk, I wanted to follow up on the status of this issue. We are also experiencing the same problem in our production environments and would appreciate any updates you can provide. Thank you for your assistance.

Jun 03 '24 14:06 abdul90082

Happy this got adressed. When can we expect a release of this?

Jun 05 '24 09:06 landorg

redis-operator redis-operator copied to clipboard

Redis Replication with Sentinels configure empty master after statefulset rollout

redis-operator
redis-operator copied to clipboard