redis-operator
redis-operator copied to clipboard
Redis Replication with Sentinels configure empty master after statefulset rollout
What version of redis operator are you using?
v0.14.0
{"level":"error","ts":1686838587.4277818,"logger":"controller_redis","msg":"Error in getting redis pod IP","Request.RedisManager.Namespace":"omri-test","Request.RedisManager.Name":"","error":"resource name may not be empty","stacktrace":"github.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getRedisReplicationMasterIP\n\t/workspace/k8sutils/redis-sentinel.go:309\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.getSentinelEnvVariable\n\t/workspace/k8sutils/redis-sentinel.go:239\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.generateRedisSentinelContainerParams\n\t/workspace/k8sutils/redis-sentinel.go:145\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.RedisSentinelSTS.CreateRedisSentinelSetup\n\t/workspace/k8sutils/redis-sentinel.go:72\ngithub.com/OT-CONTAINER-KIT/redis-operator/k8sutils.CreateRedisSentinel\n\t/workspace/k8sutils/redis-sentinel.go:45\ngithub.com/OT-CONTAINER-KIT/redis-operator/controllers.(*RedisSentinelReconciler).Reconcile\n\t/workspace/controllers/redissentinel_controller.go:50\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
{"level":"info","ts":1686838587.4279523,"logger":"controller_redis","msg":"Successfully got the ip for redis","Request.RedisManager.Namespace":"omri-test","Request.RedisManager.Name":"","ip":""}
What operating system and processor architecture are you using (kubectl version
)?
Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-08-23T17:44:59Z", GoVersion:"go1.19", Compiler:"gc", Platform:"darwin/arm64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.8", GitCommit:"a12b886b1da059e0190c54d09c5eab5219dd7acf", GitTreeState:"clean", BuildDate:"2022-06-16T05:51:36Z", GoVersion:"go1.17.11", Compiler:"gc", Platform:"linux/arm64"}
The node that the operator was running on was amd64.
What did you do?
We want to start using HA redis with sentinels in our k8s clusters. I finished setting everything up (replication, sentinels...), but after some testing I found out that every time I rollout all of our Redis Replication pods, either from a manual rollout restart or a change to a field in the RedisReplication custom resource, the whole setup would stop working because the sentinels and the replications would not set the correct master.
Deleting just the master pod manually worked fine, but 80% of the time, restarting all pods would cause issues.
This is probably related to the checkAttachedSlave
function. The master redis instances would have no slaves after the restarts, and the sentinels would default back to their default IP (0.0.0.0).
Redis replication spec:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- platform
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- test-sentinel-redis
topologyKey: kubernetes.io/hostname
weight: 50
clusterSize: 3
kubernetesConfig: {...}
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
priorityClassName: datastores
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
redisExporter:
enabled: true
env:
- name: REDIS_EXPORTER_INCL_SYSTEM_METRICS
value: 'true'
image: '<our mirror image repo>/redis-exporter:v1.45.0'
imagePullPolicy: Always
securityContext:
runAsUser: 0
tolerations:
- effect: NoSchedule
key: role
operator: Equal
value: platform
Redis Sentinel spec:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- platform
clusterSize: 3
kubernetesConfig: {...}
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
priorityClassName: datastores
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
redisSentinelConfig:
downAfterMilliseconds: '30000'
failoverTimeout: '180000'
masterGroupName: myMaster
parallelSyncs: '1'
quorum: '2'
redisPort: '6379'
redisReplicationName: test-sentinel-redis
securityContext:
runAsUser: 0
tolerations:
- effect: NoSchedule
key: role
operator: Equal
value: platform
What did you expect to see?
After the restart, I would expect to see that the redis sentinel statefulset would be updated with the correct ENV_VARS for the master redis PORT. I would expect the redis replicas to recognize each other as master/slaves.
What did you see instead? The redis sentinel statefulset was created with these env vars:
...
- name: MASTER_GROUP_NAME
value: myMaster
- name: IP
- name: PORT
value: "6379"
- name: QUORUM
value: "2"
...
The empty IP would be configured for the redis instances and I would see these logs:
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:46.867 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:46.867 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:47.869 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:47.869 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:48.871 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:48.871 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:49.873 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:49.873 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:50.875 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:50.875 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:51.877 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:51.877 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:52.879 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:52.879 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:53.881 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:53.881 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-1 test-sentinel-redis 8:S 15 Jun 2023 14:37:54.512 * Connecting to MASTER :6379
test-sentinel-redis-1 test-sentinel-redis 8:S 15 Jun 2023 14:37:54.512 # Unable to connect to MASTER: Invalid argument
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:54.883 * Connecting to MASTER :6379
test-sentinel-redis-0 test-sentinel-redis 8:S 15 Jun 2023 14:37:54.883 # Unable to connect to MASTER: Invalid argument
Running redis-cli info on one of the slaves:
root@test-sentinel-redis-0:/data# redis-cli
127.0.0.1:6379> info Replication
# Replication
role:slave
master_host:
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_read_repl_offset:0
slave_repl_offset:0
master_link_down_since_seconds:-1
slave_priority:100
slave_read_only:1
replica_announced:1
connected_slaves:0
master_failover_state:no-failover
master_replid:c6c32fbe5260385663832cc0018538b78636dd6b
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
Running redis-cli info on the master:
root@test-sentinel-redis-2:/data# redis-cli
127.0.0.1:6379> info Replication
# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:541d07b48b19e68a56530cc0b3aee1cb096937e5
master_replid2:003e8e82cfe53ec5e7286432d7a9919254ffcaa0
master_repl_offset:61464
second_repl_offset:53029
repl_backlog_active:1
repl_backlog_size:1048576
repl_backlog_first_byte_offset:41179
repl_backlog_histlen:20286
127.0.0.1:6379>
btw I checked it with the most recent version of v0.15.0 as well, and the issues are similar. Except I keep getting the log no master pods found
from the redis operator.
It also looks like non of the redis replication pods is being set to master. all 3 of them are slaves when using this new version
@omrirosner-clx Thanks for reporting this issue. There seems to be a problem while roll out That should be fixed
hey, any update on this? @shubham-cmyk
My schedule was busy I would try fixing this bug this week. Sorry for the delay. I am putting this on the priority basis
Any updates on this @shubham-cmyk ?
Any news on this? Running into that issue too it's currently blocking our production updates as every change might cause a disfunctional redis here and there.
@landorg Which redis cr setup you are using ?
@shubham-cmyk redisreplication and redissentinel:
apiVersion: redis.redis.opstreelabs.in/v1beta1
kind: RedisReplication
metadata:
annotations:
meta.helm.sh/release-name: bms-redis-persistent-replication
meta.helm.sh/release-namespace: bms-databases
creationTimestamp: "2023-09-25T17:34:47Z"
finalizers:
- redisReplicationFinalizer
generation: 3
labels:
app.kubernetes.io/component: middleware
app.kubernetes.io/instance: bms-redis-persistent-replication
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: bms-redis-persistent-replication
app.kubernetes.io/version: 0.15.0
helm.sh/chart: redis-replication-0.15.3
name: bms-redis-persistent-replication
namespace: bms-databases
resourceVersion: "675854871"
uid: 87e3850a-ee6a-452f-8d92-a945fdbbd1df
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-sentinel
topologyKey: kubernetes.io/hostname
clusterSize: 3
kubernetesConfig:
image: quay.io/opstree/redis:v7.0.5
imagePullPolicy: IfNotPresent
redisSecret:
key: password
name: bms-redis-persistent-secret
resources:
limits:
memory: 3Gi
requests:
cpu: 1m
memory: 1Mi
updateStrategy: {}
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
nodeSelector:
deploy/dbs: "true"
podSecurityContext:
fsGroup: 1000
runAsUser: 1000
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
redisConfig:
additionalRedisConfig: bms-redis-persistent-replication-ext-config
redisExporter:
image: quay.io/opstree/redis-exporter:v1.44.0
imagePullPolicy: IfNotPresent
resources: {}
storage:
volumeClaimTemplate:
metadata: {}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
storageClassName: local-path
status: {}
volumeMount: {}
apiVersion: redis.redis.opstreelabs.in/v1beta1
kind: RedisSentinel
metadata:
annotations:
meta.helm.sh/release-name: bms-redis-persistent-sentinel
meta.helm.sh/release-namespace: bms-databases
creationTimestamp: "2023-09-25T17:34:55Z"
finalizers:
- redisSentinelFinalizer
generation: 2
labels:
app.kubernetes.io/component: middleware
app.kubernetes.io/instance: bms-redis-persistent-sentinel
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: bms-redis-persistent-sentinel
app.kubernetes.io/version: 0.15.0
helm.sh/chart: redis-sentinel-0.15.3
name: bms-redis-persistent-sentinel
namespace: bms-databases
resourceVersion: "660146071"
uid: e9bd5d08-400b-4e20-b42a-6a3f7d7ea771
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-sentinel
topologyKey: kubernetes.io/hostname
clusterSize: 3
kubernetesConfig:
image: quay.io/opstree/redis-sentinel:v7.0.7
imagePullPolicy: IfNotPresent
resources:
limits:
memory: 256Mi
requests:
cpu: 1m
memory: 1Mi
updateStrategy: {}
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
nodeSelector:
deploy/dbs: "true"
podSecurityContext:
fsGroup: 1000
runAsUser: 1000
readinessProbe:
failureThreshold: 3
initialDelaySeconds: 1
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
redisSentinelConfig:
additionalSentinelConfig: bms-redis-persistent-sentinel-ext-config
downAfterMilliseconds: "30000"
failoverTimeout: "180000"
masterGroupName: bms-redis-persistent
parallelSyncs: "1"
quorum: "2"
redisPort: "6379"
redisReplicationName: bms-redis-persistent-replication
redis.redis.opstreelabs.in/v1beta1
I don't think this should work on the v1beta1
can you try this with v1beta2
For that keep in mind
Operator version is 0.15.1
and the
Redis image is v7.0.12
I don't think this should work on the v1beta1 can you try this with v1beta2
oh. I see. Interesting. How should we know? https://ot-redis-operator.netlify.app/docs/getting-started/sentinel/ https://ot-redis-operator.netlify.app/docs/getting-started/replication/
For redis sentinel what image should we use? Also v7.0.12
?
Is there a place where we can find this?
Thank you
Ah and we are using the helm charts to setup the crs where beta1 is hardcoded unfortunately: https://github.com/OT-CONTAINER-KIT/helm-charts/blob/main/charts/redis-sentinel/templates/redis-sentinel.yaml#L2 https://github.com/OT-CONTAINER-KIT/helm-charts/blob/main/charts/redis-replication/templates/redis-replication.yaml#L2
I have also encountered this problem several times with the newest 0.16.0 operator version. I also use redisreplication + sentinel setup and in some cases the operator configures replication with an empty IP and the cluster ends up in a broken state and can't heal without manually rolling the whole cluster.
Hi @shubham-cmyk. Any update on this? This is a dealbreaker for us as we already run these instances in production. We can't change anything until this is fixed.
@landorg I ditched this operator and decided to just use plain simple statefulsets instead, configuring the replication between nodes after first boot and letting Sentinel take over after that.
Using pod disruption budgets and other k8s primitives is enough to make the setup stable and fault-tolerant - the operator only breaks things in it's current form and adds unnecessary delays and Sentinel restarts.
There is a use case for the operator if creating a cluster-style setup, but for replication+sentinel, it's not really needed at all.
I'm on Operator 0.15.9 and Image 7.0.12, and bumping up against this exact issue right now.
The 0.16.0 operator has been released but the Helm chart hasn't been updated, so I guess I'm sort of stuck with manual intervention for the time being.
I had to issue replicaof <ip> <port>
commands on my replicas to get over the issue.
I am also hitting this issue. I was able to fix my cluster using command from @nathan-bowman 's comment I'm on Operator 0.15.1 and image 7.0.12
I have same issue on Operator v0.16.0 and redis v7.0.12 & v7.2.3
Operator v0.17.0 released today, still waiting on the Helm chart to get updated...
Hello @shubham-cmyk, I wanted to follow up on the status of this issue. We are also experiencing the same problem in our production environments and would appreciate any updates you can provide. Thank you for your assistance.
Happy this got adressed. When can we expect a release of this?