Seamlessly handling Elasticache (AWS) failovers?
Hi, is anyone else connecting to a Multi-AZ Elasticache cluster, and is able to seamlessly handle a failover in a way that makes the client block until the connection is reestablished, and then resumes? I'm using these options:
var options = ConfigurationOptions.Parse(address);
options.ReconnectRetryPolicy = new ExponentialRetry(250);
options.SyncTimeout = 10000;
options.AsyncTimeout = 10000;
options.AbortOnConnectFail = false;
but when I initiate a failover, I immediately get this exception:
redis_connection_failed_message: Interactive#1@#####.usw2.cache.amazonaws.com:6379 (Idle) redis_connection_failed_message: Subscription#2@#####.usw2.cache.amazonaws.com:6379 (Idle)
Unhandled Exception: StackExchange.Redis.RedisConnectionException: No connection is active/available to service this operation: GET pubsub.test.value; SocketClosed (ReadEndOfStream, last-recv: 0) on #####.usw2.cache.amazonaws.com:6379/Subscription, Idle/MarkProcessed, last: SUBSCRIBE, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 21s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.50.36290, mc: 1/1/0, mgr: 10 of 10 available, clientName: lax-jchow03-mac, IOCP: (Busy=0,Free=200,Min=16,Max=200), WORKER: (Busy=0,Free=1600,Min=16,Max=1600), v: 2.2.50.36290 ---> StackExchange.Redis.RedisConnectionException: SocketClosed (ReadEndOfStream, last-recv: 0) on #####.usw2.cache.amazonaws.com:6379/Subscription, Idle/MarkProcessed, last: SUBSCRIBE, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 21s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.50.36290 --- End of inner exception stack trace ---
Is there another group of settings I'm missing? (I don't think 'connectRetry' is relevant)
Thanks in advance, Jeff
@jchowdown: Did you have any success? I am in a similiar situation. But not on AWS but self-hosted Kubernetes.
There's nothing built in to help AWS here, but we are working with the Azure folks (who reached out to us) on something and making it reusable (basically a pub/sub we'll listen for and proactively reconnect when we know it's coming). If anyone knows anyone at AWS, I'd welcome making connections to their service more reliable in the same way - see #1876 as a template).
AWS reports maintenance events using AWS SNS notifications, not through Redis' pub/sub mechanism. What would be the best way to inject external messages to the client?
@shachlanAmazon If y'all can provide them via pub/sub the client can easily listen - we can't consume any other source from the base library (without adding dependencies, which we strive to avoid). The choices are: wrapping the library and maintaining that, or a pub/sub which we can build it and recognize AWS endpoints - which if there are DNS patterns to recognize we have full support for just need details for adding AWS. See the PR above there - that's the preferred method and would be useful for any client to implement.
@NickCraver We're looking into adding pub/sub maintenance events.
ATM, what should be the user's expectation regarding a cluster mode disabled server with replicas? I see that the Jedis client can handle failover there, and use the promoted replica as the new master. Is this possible in SE.R?
In my tests I see that when trying to set values after a failover then the client throws a timeout exception if the attempt happens shortly after the failover, and a connection exception (No connection is active/available to service this operation) if a couple of minutes pass between the failover and the set attempt. This differences seems to be determined by whether a heartbeat was sent, which triggered a call to OnDisconnected on the server endpoint.
Can the replicas be reevaluated if a heartbeat disconnects the primary, to see if one of them was promoted to primary? Is it possible to check if an endpoint is connected if a request times out, and again reevaluate the replicas if it isn't connected?
@shachlanAmazon The expectation at the moment is "some heartbeats will recover it". It's actually next up on my list after #2050, to see how we can better recognize "hey, something's off - go reconfigure" like we do with .SetAuthSuspect() today. It takes a bit of setup so need to take a few evenings and do that (test suite doesn't have replicas to the cluster atm).
TL;DR: Current sucks for some scenarios we don't have heads up on because AFAIK there's no event from Redis to let us know this happened (unlike say Sentinel which publishes changes). Definitely on the radar to improve - I could 7 issues right now which are all ultimate this same scenario.
Great, happy to know this is a known issue. Anything I can do to help? As I mentioned, I have an EC2 machine on which I can consistently reproduce such a failover.
@shachlanAmazon If you're in a position to test a branch when ready, that'd be awesome - are you in a position to build/run such a build locally? If not, can figure out package logistics.
I can test a branch on Linux or Windows machines using an ElastiCache server.
Hi @NickCraver
It's actually next up on my list after https://github.com/StackExchange/StackExchange.Redis/pull/2050, to see how we can better recognize "hey, something's off - go reconfigure" like we do with .SetAuthSuspect() today
Did you make progress in regards to this topic?
Would be very interested in a reliable failover on the AWS environment.