StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

SignalR backplane with disabled CLUSTER command intermittently gets stuck in disconnect/reconnect loop

Open tylerohlsen opened this issue 3 years ago • 6 comments

Problem Description

Very similar to issue #2012, we are intermittently running into a pattern of events that causes more attempts to use the CLUSTER command even though it is disabled via the connection string. It starts with a single HashSlotMoved event and then it gets stuck in a loop back and forth between ConnectionFailed and ConnectionRestored. After a while (duration varies) the events stop and we are back to normal. While this is happening, the SignalR backplane is not functional. The duration from start to finish can vary from 1 second to 10 minutes.

It seems to start with a HashSlotMoved event with args OldEndPoint = null, NewEndPoint = <MasterIpAddress>, HashSlot = 6202 (always exactly that number).

This event is odd because we don't explicitly store anything in the Redis instances. We're only using the Pub/Sub feature via a SignalR backplane. And it doesn't look like the SignalR backplane Redis adapter stores anything either (source).

Then the ConnectionFailed events start with args EndPoint = InterNetwork/<MasterDnsName>, FailureType = InternalFailure, ConnectionType = Interactive, and the below exception.

StackExchange.Redis.RedisConnectionException: InternalFailure (None, last-recv: 364) on <MasterDnsName>:6379/Interactive, Writing/ReadAsync, last: CLUSTER, origin: WriteMessageToServerInsideWriteLock, outstanding: 2, last-read: 0s ago, last-write: 0s ago, unanswered-write: 0s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 10 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: never, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
 ---> StackExchange.Redis.RedisCommandException: This operation has been disabled in the command-map and cannot be used: CLUSTER
   at StackExchange.Redis.PhysicalConnection.WriteHeader(RedisCommand command, Int32 arguments, CommandBytes commandBytes) in /_/src/StackExchange.Redis/PhysicalConnection.cs:line 715
   at StackExchange.Redis.Message.CommandValueMessage.WriteImpl(PhysicalConnection physical) in /_/src/StackExchange.Redis/Message.cs:line 1252
   at StackExchange.Redis.Message.WriteTo(PhysicalConnection physical) in /_/src/StackExchange.Redis/Message.cs:line 775
   at StackExchange.Redis.PhysicalBridge.WriteMessageToServerInsideWriteLock(PhysicalConnection connection, Message message) in /_/src/StackExchange.Redis/PhysicalBridge.cs:line 1361
   --- End of inner exception stack trace ---

And then immediately after the failed event, we get ConnectionRestored event with args EndPoint = InterNetwork/<MasterDnsName>, ConnectionType = Interactive.


Questions

  1. Any thoughts on why hash slot 6202 is significant?
  2. Do you know if there's metadata stored on the server itself or because of a .NET driver and slot 6202 is one of those?
  3. The stack trace doesn't give us any indication of where the CLUSTER command was issued from. Is there another way to determine this?
  4. I see from documentation that HashSlotMoved "will normally be automatically re-routed". Would that re-routing cause a CLUSTER command?

Thoughts

Maybe we could do something similar as was done to fix #2012 and put in another cluster command availability check to short circuit some logic? Just a wild guess, but maybe here:

https://github.com/StackExchange/StackExchange.Redis/blob/c1aaf4f990544b17a7cdcf773bbef1309c88d73c/src/StackExchange.Redis/ServerSelectionStrategy.cs#L167

tylerohlsen avatar Apr 27 '22 15:04 tylerohlsen

I'm not sure about the 6202, may be trying to get the tie breaker (you can set it to null in config to disable). Overall though the CLUSTER issue in #2012 was already fixed in latest, so if you grab a library update it should resolve the clustering issues :)

NickCraver avatar Apr 27 '22 15:04 NickCraver

The CLUSTER command is fundamental to this library working correctly - I'm amazed things are working at all without that, quite honestly. We use CLUSTER NODES to discover the shard topology (there is also CLUSTER SLOTS and CLUSTER SHARDS, but CLUSTER NODES was the only version that existed when we wrote the cluster code). Without an overview of the topology, the performance will be terrible, as it will rely on MOVED errors to build things, and it won't have a clue where to send things initially. I don't think we've ever tested stability with CLUSTER disabled, and I'm not sure I can recommend that option.

Can I ask: why is that disabled?

Re the questions:

  1. no significance; if you're seeing that one a lot then either it corresponds to a key you're using a lot / early, or you're using hash-tags and the hash-tag is being used a lot / early; I guess there's a chance it relates to the tiebreaker key we use for non-cluster?
  2. possibly the tie-breaker, but that's a guess
  3. the library issues this
  4. because when we see MOVED, we have reason to believe that we don't have a good understanding of the shard topology, we attempt to rebuild it by asking the server the current state

mgravell avatar Apr 27 '22 15:04 mgravell

@NickCraver I already have the fix from #2012. This is similar, I imagine there must be another code path that's trying to run another CLUSTER command that isn't protected by a CommandMap.IsAvailable(RedisCommand.CLUSTER) check. Unfortunately, the stack trace doesn't give enough information to know what cluster command was attempted.

tylerohlsen avatar Apr 27 '22 16:04 tylerohlsen

@mgravell I'm using a Redis cluster as a SignalR backplane and using it for PubSub only. Each node of the cluster is physically separated by a large distance. I have the networking set up so all nodes of the Redis cluster can talk to each other on a private subnet, but the applications that connect to the Redis nodes can only communicate with the Redis node in the same physical location.

This setup works very well so users that connect to the application in same physical location have very low latency with each other, and users that connect to different physical locations have higher latency, but they can still receive messages across the backplane.

I disable the CLUSTER command because I don't want the applications to discover the other nodes in the cluster. I want to have control of which node they connect to (via the connection string). In fact, if they did discover the other nodes in the cluster, they wont be able to make a connection anyway.

tylerohlsen avatar Apr 27 '22 16:04 tylerohlsen

@tylerohlsen I'm not sure where the mismatch here is, but the error message you posted here has v: 2.2.88.56325, but the fix for #2012 in #2014 wasn't released until 2.5.43 (see here: https://stackexchange.github.io/StackExchange.Redis/ReleaseNotes#2543). My best guess is either it's not updated somewhere, or that update isn't deployed if you're still seeing errors with the old version. That's a key point of why we stick it in there, no guessing and no other way for that version number to get in the message - it's stamped in the DLL :)

NickCraver avatar Apr 27 '22 16:04 NickCraver

@NickCraver I thought that version stamp was incorrect because I thought I was using a MyGet pre-release version of the library. I see now that you are correct and that pre-release version is no longer referenced.

Sorry about the rabbit hole. I'll update to latest with the #2012 fix. 🤦

tylerohlsen avatar Apr 27 '22 16:04 tylerohlsen

@tylerohlsen checking in - did the package update get you sorted?

NickCraver avatar Aug 21 '22 14:08 NickCraver

Yep, all good. Thanks

tylerohlsen avatar Aug 21 '22 20:08 tylerohlsen