redis-operator
redis-operator copied to clipboard
Redis can't recover after node is down with redis-replicas + HA sentinel
V0.18.0
Defaulted container "redis-sentinel-sentinel" out of: redis-sentinel-sentinel, redis-exporter
Running sentinel without TLS mode
ACL_MODE is not true, skipping ACL file modification
Starting sentinel service .....
7:X 27 Aug 2024 09:41:07.340 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
7:X 27 Aug 2024 09:41:07.340 * Redis version=7.2.1, bits=64, commit=00000000, modified=0, pid=7, just started
7:X 27 Aug 2024 09:41:07.340 * Configuration loaded
7:X 27 Aug 2024 09:41:07.341 * monotonic clock: POSIX clock_gettime
7:X 27 Aug 2024 09:41:07.345 # Failed to write PID file: Permission denied
7:X 27 Aug 2024 09:41:07.345 * Running mode=sentinel, port=26379.
7:X 27 Aug 2024 09:41:07.353 * Sentinel new configuration saved on disk
7:X 27 Aug 2024 09:41:07.353 * Sentinel ID is 772d1d4234446162e55d26c4472ad3e5b2d52f28
7:X 27 Aug 2024 09:41:07.353 # +monitor master myMaster 10.42.3.4 6379 quorum 2
7:X 27 Aug 2024 09:41:12.352 # +sdown master myMaster 10.42.3.4 6379
redis-operator version:
Does this issue reproduce with the latest release? Yes
What operating system and processor architecture are you using (kubectl version)?
Ubuntu 22 + kube
kubectl version Output
$ kubectl version Client Version: v1.28.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.3+k3s1 WARNING: version difference between client (1.28) and server (1.30) exceeds the supported minor version skew of +/-1
What did you do?
Create local cluster with k3d
k3d cluster create redis-operator-ot-container --servers 3 --agents 3
kubectl taint nodes k3d-redis-operator-ot-container-server-0 node-role.kubernetes.io/master=:NoSchedule
kubectl taint nodes k3d-redis-operator-ot-container-server-1 node-role.kubernetes.io/master=:NoSchedule
kubectl taint nodes k3d-redis-operator-ot-container-server-2 node-role.kubernetes.io/master=:NoSchedule
Create namespace
kubectl create namespace redis-dev-ot-operator
Configure helm
helm repo add ot-helm https://ot-container-kit.github.io/helm-charts/
helm repo update
Create redis replication OT operator
helm upgrade redis-operator ot-helm/redis-operator --install --namespace redis-dev-ot-operator
helm test redis-operator --namespace redis-dev-ot-operator
Create redis sentinel
helm upgrade redis-sentinel ot-helm/redis-sentinel --install --namespace redis-dev-ot-operator
Create redis replication
helm upgrade redis-replication ot-helm/redis-replication --install --namespace redis-dev-ot-operator
At the end you should have 3 agents node where each one have 1 sentinel and 1 replica. 1 replica is the master and other are slaves.
then look for agent name where master redis is deployed
Test a chaos scenario where node with master is down.
k drain <node>
What did you expect to see? Sentinel agent + replica instance are redeployed and sentinel process new master election
What did you see instead? Sentinel agent + replica instance are redeployed on a node that already have sentinel + replica instances. Thus, no master election and each sentienl are stuck.
Below log from sentinel
Defaulted container "redis-sentinel-sentinel" out of: redis-sentinel-sentinel, redis-exporter
Running sentinel without TLS mode
ACL_MODE is not true, skipping ACL file modification
Starting sentinel service .....
7:X 27 Aug 2024 09:41:07.340 * oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
7:X 27 Aug 2024 09:41:07.340 * Redis version=7.2.1, bits=64, commit=00000000, modified=0, pid=7, just started
7:X 27 Aug 2024 09:41:07.340 * Configuration loaded
7:X 27 Aug 2024 09:41:07.341 * monotonic clock: POSIX clock_gettime
7:X 27 Aug 2024 09:41:07.345 # Failed to write PID file: Permission denied
7:X 27 Aug 2024 09:41:07.345 * Running mode=sentinel, port=26379.
7:X 27 Aug 2024 09:41:07.353 * Sentinel new configuration saved on disk
7:X 27 Aug 2024 09:41:07.353 * Sentinel ID is 772d1d4234446162e55d26c4472ad3e5b2d52f28
7:X 27 Aug 2024 09:41:07.353 # +monitor master myMaster 10.42.3.4 6379 quorum 2
7:X 27 Aug 2024 09:41:12.352 # +sdown master myMaster 10.42.3.4 6379
Had a very similar incident after a node autoscaling event followed by a pod rebalance. One thing worth noticing is that my Redis replication nodes (non-master ones) printed a lot of logs like this:
2024-11-16T05:42:17.116997335Z 1:S 16 Nov 2024 05:42:17.116 * Connecting to MASTER :6379
2024-11-16T05:42:17.117026707Z 1:S 16 Nov 2024 05:42:17.116 # Unable to connect to MASTER: Invalid argument
2024-11-16T05:42:18.119024659Z 1:S 16 Nov 2024 05:42:18.118 * Connecting to MASTER :6379
2024-11-16T05:42:18.119048774Z 1:S 16 Nov 2024 05:42:18.118 # Unable to connect to MASTER: Invalid argument
2024-11-16T05:42:19.120973784Z 1:S 16 Nov 2024 05:42:19.120 * Connecting to MASTER :6379
2024-11-16T05:42:19.121016883Z 1:S 16 Nov 2024 05:42:19.120 # Unable to connect to MASTER: Invalid argument
2024-11-16T05:42:20.123027140Z 1:S 16 Nov 2024 05:42:20.122 * Connecting to MASTER :6379
While the master Redis node prints:
2024-11-16T01:33:11.144362406Z Setting up redis in standalone mode
2024-11-16T01:33:11.144763663Z Running without TLS mode
2024-11-16T01:33:11.144770083Z ACL_MODE is not true, skipping ACL file modification
2024-11-16T01:33:11.144772981Z Starting redis service in standalone mode.....
2024-11-16T01:33:11.150182784Z 1:C 16 Nov 2024 01:33:11.149 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
2024-11-16T01:33:11.150194669Z 1:C 16 Nov 2024 01:33:11.150 # Redis version=7.0.15, bits=64, commit=00000000, modified=0, pid=1, just started
2024-11-16T01:33:11.150198672Z 1:C 16 Nov 2024 01:33:11.150 # Configuration loaded
2024-11-16T01:33:11.150548785Z 1:M 16 Nov 2024 01:33:11.150 * monotonic clock: POSIX clock_gettime
2024-11-16T01:33:11.151035317Z 1:M 16 Nov 2024 01:33:11.150 * Running mode=standalone, port=6379.
2024-11-16T01:33:11.151042926Z 1:M 16 Nov 2024 01:33:11.151 # Server initialized
2024-11-16T01:33:11.156624576Z 1:M 16 Nov 2024 01:33:11.156 * Reading RDB base file on AOF loading...
2024-11-16T01:33:11.156637212Z 1:M 16 Nov 2024 01:33:11.156 * Loading RDB produced by version 7.0.15
2024-11-16T01:33:11.156640260Z 1:M 16 Nov 2024 01:33:11.156 * RDB age 119424 seconds
2024-11-16T01:33:11.156643125Z 1:M 16 Nov 2024 01:33:11.156 * RDB memory usage when created 5.28 Mb
2024-11-16T01:33:11.156689047Z 1:M 16 Nov 2024 01:33:11.156 * RDB is base AOF
2024-11-16T01:33:11.181245948Z 1:M 16 Nov 2024 01:33:11.181 * Done loading RDB, keys loaded: 621, keys expired: 0.
2024-11-16T01:33:11.181282094Z 1:M 16 Nov 2024 01:33:11.181 * DB loaded from base file appendonly.aof.7.base.rdb: 0.027 seconds
2024-11-16T01:33:12.567363747Z 1:M 16 Nov 2024 01:33:12.567 * DB loaded from incr file appendonly.aof.7.incr.aof: 1.386 seconds
2024-11-16T01:33:12.567387139Z 1:M 16 Nov 2024 01:33:12.567 * DB loaded from append only file: 1.413 seconds
2024-11-16T01:33:12.567390607Z 1:M 16 Nov 2024 01:33:12.567 * Opening AOF incr file appendonly.aof.7.incr.aof on server start
2024-11-16T01:33:12.567394224Z 1:M 16 Nov 2024 01:33:12.567 * Ready to accept connections
2024-11-16T01:34:12.099062295Z 1:M 16 Nov 2024 01:34:12.098 * 10000 changes in 60 seconds. Saving...
2024-11-16T01:34:12.099330444Z 1:M 16 Nov 2024 01:34:12.099 * Background saving started by pid 57
2024-11-16T01:34:12.123990362Z 57:C 16 Nov 2024 01:34:12.123 * DB saved on disk
2024-11-16T01:34:12.124270877Z 57:C 16 Nov 2024 01:34:12.124 * Fork CoW for RDB: current 0 MB, peak 0 MB, average 0 MB
2024-11-16T01:34:12.199655561Z 1:M 16 Nov 2024 01:34:12.199 * Background saving terminated with success
It looks like no Redis instances are receiving the correct replication config.
facing similar issues, cc: @drivebyer @shubham-cmyk
I am facing the exact same issue, has anyone found any solution?
@husnialhamdani @sho34215
It's the same issue for me. If Redis replica pods are restarted and IPs changes, Sentinel cannot connect to the old master and doesn't start elect to switch to another one.
I have to manually reconfigure Sentinel and specify a new master by providing the Pod IP. This makes the whole setup unusable in production.
This issue looks similar to this fixed one https://github.com/OT-CONTAINER-KIT/redis-operator/issues/522
EDIT: Forget it, this fix was released in 0.18.0, but I'm on 0.18.1 and experiencing the same problem as others here. I hope an update to the latest version fixes the problem.
Please try 0.21.0
As of version 0.21.0, we are still experiencing this issue.