StackExchange.Redis
StackExchange.Redis copied to clipboard
Sentinel failover for hung master server
Hello all, We are in the process of trying to get a production-ready implementation of Redis/Sentinel, .NET core 3 and SE.Redis 2.2.4 working. We have 3 sentinels and 3 Redis servers running in docker containers. I am using what I think is standard initialization with certificate validation:
ConfigurationOptions redisOptions = new ConfigurationOptions();
redisOptions.EndPoints.Add("localhost:26379);
redisOptions.EndPoints.Add("localhost:26380);
redisOptions.EndPoints.Add("localhost:26381);
redisOptions.ServiceName = "testService";
redisOptions.Ssl = true;
redisOptions.TieBreaker = "";
redisOptions.AllowAdmin = true;
redisOptions.AbortOnConnectFail = false;
redisOptions.CertificateSelection += ConfigurationOptions_CertificateSelection;
redisOptions.CertificateValidation += RedisOptions_CertificateValidation;
_redis = ConnectionMultiplexer.Connect(redisOptions);
For the most part, everything seems to be working well; if I shut down the docker container for the master server, Sentinel notifies SE.Redis and the current master is changed. All I had to do was add a little retry logic to my session helper code to make it fail over gracefully.
This all works great for "downed" servers: either I take it down for maintenance or it crashes and goes away. Where it's not working so well is with a "hung" server. For example, I am using the Redis command "debug sleep 60" to test, which effectively makes the server unreachable as if it were hung (according to Sentinel docs). In this case, SE.Redis does not appear to recover, because it just keeps trying the non-responsive master, waiting for timeout, etc. Once the master comes online it starts working again.
Should I be doing something differently when I receive a RedisTimeoutException in order to compensate, or should this be working as configured? I tried calling Configure in the ConnectionFailed delegate, but that didn't work. Also tried instantiating a new ConnectionMultiplexer in the delegate with similar results.
Most of all, thanks for a great open source package, and your dedication to it.
Similar issue on my side as well. I don't use docker though. My setup:
- 3 Ubuntu VMs, each with a server and a sentinel pair. One master, two replicas
- Redis 5.0.7 installed on the VMs
- StackExchange.Redis (2.2.4)
- .NET 5 and ASP.NET Core
If I manually bring down master by stopping the service, like with systemctl stop redis
, failover happens after the down-after-milliseconds
time span and SE.Redis client does the proper switch to the new master.
If I instead do the debug sleep x
command on the master, SE.Redis client will not switch to a new master, even though I confirmed new master was elected directly on the servers via redis-cli by subscribing to sentinel events. From that point on, until sleep is over, all requests will timeout.
If I restart the application(and therefore initialize a new ConnectionMultiplexer) connection will be restored before sleep expires.
My connections string is: redis1,redis2,redis3,serviceName=mymaster,allowAdmin=true
, where redisx are the hostnames of VM. I instantiate only one ConnectionMultiplexer for the entire app within Startup.cs.
var redisConnection = ConnectionMultiplexer.Connect(cacheConfiguration["Configuration"]);
services.AddSingleton<IConnectionMultiplexer>(redisConnection);
Hope this is just some weird interaction between SE client and the debug command.
We made some improvements here in the 2.2.50 release. It should help this specific scenario (and many others) which had the chance of race conditions before. We'd appreciate any feedback after trying the new version (and logging if possible, please!)
We are now testing 2.2.79 (2.2.50 and above contains fix of #1632) and also checking if any outstanding issue around 2.2.50/79. For this issue when I kicked off "debug sleep 60" the failover occurs quickly and so far so good. Will keep testing and report issue if any.
Update: using exactly the same debug sleep on same Redis cluster, now I hit the issue. The read/write all gets timed out on the master that under debug sleep and back to normal when debug sleep ends.
Further testing shows the following chronology events (down-after-milliseconds
set to 3000, and below time is in mm:ss
):
- 00:00 debug sleep 30 on current master (172.17.6.70)
- 00:03 new master is elected.
sentinel master mymaster
shows the IP of new master - 00:05 Started seeing lots of read or write failure due to 5000ms timeout, but still can see some read/write were successful
- 00:10 A config change event showing that old master is in
not in use: DidNotRespond
status but printing the end point info shows that both new and old master are marked asmaster
like this:172.17.6.18:6379: Standalone v4.0.14, master; sub: ConnectedEstablished, 1 active
172.17.6.70:7379: Standalone v4.0.14, master; not in use: DidNotRespond
- 00:30 around 30 seconds (align with the debug sleep value) or so the timeout error stopped to appear
- 00:40 around 40 seconds or so another config change event shows that the old master is converted into slave:
172.17.6.18:6379: Standalone v4.0.14, master; sub: ConnectedEstablished, 1 active
172.17.6.70:7379: Standalone v4.0.14, replica; sub: ConnectedEstablished, 1 active
If I use debug sleep 60 then timeout error stopped around 60 seconds and this event happens around 01:10 and so on.
I am re-using the same ConnectionMultiplexer
instance across the whole testing instead of keeping ConnectionMultiplexer.Connect()
and redis.Close()
.
I did not yet spend time diving into the code, but my wild guess is that the connections in the connection pool might need a refresh for connections that still point to the old master.
Updates:
I also tried the scenario with older version like 1.2.6 (it has no sentinel mode so the conn string is a list of all nodes' IP). The failover occurs around 3 seconds or so, and the SE.Redis 1.2.6 is able to redirect the traffic to new master instead of keeping timeout.
Then I wonder if this issue is sentinel mode specific so I tried 2.2.79 without sentinel mode and still hit this issue.
@Victor-Tseng what errors are you getting while this happens? This is a set of scenarios I'd like to poke at coming up for "we're getting a lot of X errors, let's re-check state" or re-checking on some backoff throttle which may help this and some cluster/replica scenarios. The errors you get while this happens would help a lot!
@NickCraver After the config change event at 00:10 driver should be fully aware of the new master 172.17.6.18
. However we could still hit timeout errors for both GET and SET as show below after that event. The log shows that it is still trying to access the old master at 172.17.6.70
and to me it looks like some connections are not swung to new master (not every GET/SET request after this config change event gets timed-out).
Timeout performing GET (5000ms), next: SET TimeNow, inst: 0, qu: 0, qs: 19, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: 172.17.6.70:7379, mc: 1/1/0, mgr: 10 of 10 available, clientName: VT108, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=8,Free=2039,Min=8,Max=2047), v: 2.2.79.4591 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
Timeout performing SET (5000ms), next: SET TimeNow, inst: 0, qu: 0, qs: 19, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0, serverEndpoint: 172.17.6.70:7379, mc: 1/1/0, mgr: 10 of 10 available, clientName: VT108, IOCP: (Busy=0,Free=1000,Min=8,Max=1000), WORKER: (Busy=8,Free=2039,Min=8,Max=2047), v: 2.2.79.4591 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)
I put together a test bed to see how the client reacts to failovers here
The log shows that it is still trying to access the old master at 172.17.6.70 and to me it looks like some connections are not swung to new master (not every GET/SET request after this config change event gets timed-out).
This is the exact behavior I see from my test. The client registers the promotion of the former slave, but not the demotion of the former master. Client logs show 2 nodes as being master after the failover. In this state commands will timeout on the original master before being sent to the new one.