StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

Error message improvement when we're unable to reconnect

Open stambunan opened this issue 6 months ago • 9 comments
trafficstars

We've been trying to debug an intermittent issue where our application gets disconnected from Azure Redis and we're seeing dead scokets detected with a lot of timeouts.

In addition to that, we're also seeing the below message which confused me at first. The first part of the message seems to indicate an authentication failure but it seems to be an IOException issue where the application isn't able to connect to redis.

Can the message be made clearer since it's not an issue with the passwords, but there was something else going on which prevented a reconnect?

Perhaps we can update this to check for IOExceptions and return a different message?

Error: [StackExchange.Redis.ConnectionMultiplexer] It was not possible to connect to the redis server(s) server:1234/Interactive. There was an authentication failure; check that passwords (or client certificates) are configured correctly: (IOException) Unable to read data from the transport connection: Operation canceled. ConnectTimeout

stambunan avatar May 13 '25 04:05 stambunan

You say "it's not an issue with the passwords" - does that mean you're using Entra ID auth? If so, are you using the latest version of Microsoft.Azure.StackExchangeRedis?

philon-msft avatar May 13 '25 16:05 philon-msft

We're using a username/password to connect to Azure Cache for Redis and not Entra ID. I feel our issue wasn't caused by an authentication failure, but instead by a dead socket. We're still investigating why this seems to happen though.

I see the following logs as well and the (IOException) above seem to support that it's a connection problem caused by networking issues.

Warning: [StackExchange.Redis.ConnectionMultiplexer] Dead socket detected, no reads in 170 seconds with 1 timeouts, issuing disconnect

I was confused for a while about the authentication failure message when the actual issue was caused by something other than an authentication failure.

stambunan avatar May 13 '25 23:05 stambunan

The dead socket is usually a result of maintenance on low-level cloud infrastructure. How often are you seeing those?

philon-msft avatar May 14 '25 00:05 philon-msft

We're seeing it once every few days. I'm not sure if it's related but we've moved from windows to linux instances when we started noticing more of these issues. Since then, we've done a bunch of improvements by asyncifying methods, and updating to the latest stackexhcange.redis which seems to have reduced the number of errors.

Should we subscribe to https://stackexchange.github.io/StackExchange.Redis/ServerMaintenanceEvent and potentially create a new multiplexer when maintenance occurs?

stambunan avatar May 14 '25 05:05 stambunan

It sounds like you're experiencing "stalled sockets", which are more common for client apps running on Linux. StackExchange.Redis is correctly detecting and reconnecting (see #2610), so I wouldn't recommend subscribing to maintenance notifications or making any other code changes. But you may want to look into why your Redis connections are being lost abruptly. Are you running in Kubernetes or some other environment that could interfere with network connections?

More details on stalled sockets:

When a Redis client connection is lost and a client does not receive an explicit signal to close the connection via a TCP FIN or RST message, the network socket on that client will go into a stalled state. In this state the client OS cannot tell that the connection is lost and it continues trying to transmit network traffic to the Redis cache.

On Windows clients the OS will quickly recognize that there's no response from Redis, and close the socket within seconds. When StackExchange.Redis sees the socket close, it immediately creates a replacement connection and restores communication with the Redis cache.

Linux handles stalled sockets differently, causing a long delay before the Redis connection can be automatically restored. By default, most Linux distros are configured to retransmit messages on a non-responsive TCP socket for up to 15 minutes before they give up and close the socket. This means that Redis client libraries will continue to send commands to the cache for 15 minutes without receiving any errors about failed network calls. Redis commands will simply time out, making it appear that the Redis cache is unresponsive. In fact, the cache is fully available and responsive to clients that have (or create) a functioning network connection to it.

philon-msft avatar May 14 '25 15:05 philon-msft

Thanks for explaining @philon-msft.

Our setup is currently a Linux app service inside a VNET that uses a private endpoint to connect to Azure cache. We added a private endpoint to try and mitigate SNAT issues we were facing when moving from a Windows app service to a Linux one. We're unable to move the azure cache into the vnet since it was created a while ago without a vnet.

We're on the latest version, 2.8.37, which should have the fix in https://github.com/StackExchange/StackExchange.Redis/pull/2610.

We have syncTimeout set to 10 seconds, and we should be forcing a disconnect after 10 * 4 (40 seconds). However, I'm seeing a few logs where no reads have been done in 200+ seconds before a disconnect happens. It's a bit hard to follow the logs since we have a few multiplexers instantiated (long living) and the logs don't indicate which multiplexer it's coming from. I'm not sure if the codepath is hit multiple times and I'm just seeing logs from one multiplexer, but I suppose that https://github.com/StackExchange/StackExchange.Redis/blob/4b6ab83ab39cbd28f5b2c72789f7fbd6842dbdf8/src/StackExchange.Redis/PhysicalBridge.cs#L635 would dispose the physical connection?

Warning: [StackExchange.Redis.ConnectionMultiplexer] Dead socket detected, no reads in 212 seconds with 11 timeouts, issuing disconnect

Do you have any suggestions on our setup or why in some cases it takes a while to issue a disconnect?

stambunan avatar May 15 '25 00:05 stambunan

Many Linux distros set the tcp_retries2 to 15. Combined with exponential retries, this can add up to 15 minutes easily. Imagine a silent disconnnect on a socket would cause to OS to wait that amount of time to finally release the socket I/O and return a 0-byte read. Have you tried capturing a tcpdump full verbose or running netstat -tanop and check the timers column to see how often has an connection retried and how many bytes have are stuck in the send or receive buffer?

milope avatar May 15 '25 02:05 milope

I thought the newer versions of StackExchange.Redis have been updated so it doesn't wait the 15 minutes? https://github.com/StackExchange/StackExchange.Redis/issues/2595

We're also running a docker container on linux app service which is making it tough to do a tcp dump since we need to install it in the container.

stambunan avatar May 20 '25 01:05 stambunan

I thought the newer versions of StackExchange.Redis have been updated so it doesn't wait the 15 minutes? https://github.com/StackExchange/StackExchange.Redis/issues/2595

We're also running a docker container on linux app service which is making it tough to do a tcp dump since we need to install it in the container.

Try running netstat -tanop several times during an incident. If Recv-Q and Send-Q are leaving bytes on the send and receive buffers, those numbers will rise while the timers for retransmission keep rising. If the numbers stay in 0, then I would lean towards application threads being blocked rather than socket I/O.

miklopz avatar May 21 '25 00:05 miklopz