StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

StackExchange.Redis.RedisConnectionException: SocketClosed on Azure Redis Cache

Open ms92ita opened this issue 4 years ago • 17 comments

In last 2 months, we have encountered many connection errors like "No connection is available", but the most blocking are SocketClosed.

First error type lasts some seconds and then goes away itself, but the second (SockeClosed) blocks all connections and Redis realted application tasks until we restart manually the machine and/or killing application pool.

Error details: StackExchange.Redis.RedisConnectionException: SocketClosed on StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: GET geoinfos_states; SocketClosed on XXXXXXXXXX.redis.cache.windows.net:6380/Interactive, Idle/MarkProcessed, last: SETEX, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 54s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.0.601.3402; IOCP: (Busy=1,Free=999,Min=4,Max=1000), WORKER: (Busy=0,Free=32767,Min=4,Max=32767), Local-CPU: n/a ---> StackExchange.Redis.RedisConnectionException: SocketClosed on XXXXXXXXXX.redis.cache.windows.net:6380/Interactive, Idle/MarkProcessed, last: SETEX, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 54s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.0.601.3402

Code for Redis connections:

private static Lazy<ConnectionMultiplexer> lazyConnection = new Lazy<ConnectionMultiplexer>(() => ConnectionMultiplexer.Connect(GetConnectionString("RedisCacheConnection")));

public static ConnectionMultiplexer Connection { get { return lazyConnection.Value;  } }

In other issues here, upgrading the library does not resolve definitely the issues. Can we manage and/or get a workaround for these connection error (with priority for the socket issue) without restarting the service (for example recreating the multiplexer)?

Version in use: v2.0.601.3402 Related Service: Azure Redis Cache C2 tier with single instance Our application is running .net v4.7.2

Connection pattern in use: XXXXXXXXXX.redis.cache.windows.net,abortConnect=false,ssl=true,password=XXXXXXXXXX,syncTimeout=30000,allowAdmin=true

ms92ita avatar Jan 11 '21 09:01 ms92ita

There are additional fixes to this in 2.2.x series, but we're working with the Azure team to detect when they do maintenance on the server side and make the library handle this much quicker when a failover is going to happen behind the scenes.

cc @deepakverma on this one - something we're actively looking at.

NickCraver avatar May 20 '21 23:05 NickCraver

@NickCraver Is this the issue to follow to see updates on this situation, or is there another one that tracks related development?

jaswope avatar May 21 '21 14:05 jaswope

Can we manage and/or get a workaround for these connection error (with priority for the socket issue) without restarting the service (for example recreating the multiplexer)? @ms92ita here's an approach to follow https://gist.github.com/JonCole/925630df72be1351b21440625ff2671f#reconnecting-with-lazyt-pattern

deepakverma avatar May 26 '21 21:05 deepakverma

We've been working hard on connections, especially in cloud/high latency environments. For anyone hitting connect and reconnect issues especially, please give 2.2.50 or higher a try (available in NuGet) - it has quite a few improvements in this area and better logging for anything hit. If you still see an issue after upgrading please post the connection logs, as they'll be more helpful than ever.

NickCraver avatar Jun 23 '21 02:06 NickCraver

Still these errors on 2.2.62. Occur randomly and if they do, then always several in a row. After that, it is quiet again, sometimes for days.

SocketClosed (ReadEndOfStream, last-recv: 0) on xyz.redis.cache.windows.net:6379/Interactive, Idle/MarkProcessed, last: PING, origin: ReadFromPipe, outstanding: 1, last-read: 0s ago, last-write: 0s ago, unanswered-write: 0s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.62.27853

DarthSonic avatar Jul 06 '21 11:07 DarthSonic

@NickCraver We have a memory dump from when this issue was reproducing on 2.2.4. We just upgraded to 2.2.62 and are currently monitoring to see if it happens again. Unfortunately, we cannot share the memory dump, as it's from our production systems, but we would be more than happy to provide any kind of info from it (state of any StackExchange.Redis object). Would that help? L.E: The issue happened with 2.2.62 too, we have a memory dump from this one also.

cosminvlad avatar Jul 24 '21 06:07 cosminvlad

We are getting the following message 8 times, but that happens more than 15 minutes after everything blocks in Redis: SocketClosed on xxx.redis.cache.windows.net:6380/Interactive, Flushing/MarkProcessed, last: HMSET, origin: ReadFromPipe, outstanding: 1406, last-read: 0s ago, last-write: 922s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.62.27853 This can be seen only on one out of 6-8 containers. It happened several days in a row and not always with the same pod.

I'm also attaching an analysis done on version 2.2.62, were you can see some of the internal state of the library. Happy to provide more if needed: core_20210724_063807-redis-analysis.txt

cosminvlad avatar Jul 24 '21 14:07 cosminvlad

@cosminvlad Is anything happening on the server side here? It looks like the socket is actually getting torn down - any reason that may be happening in the environment?

NickCraver avatar Jul 25 '21 21:07 NickCraver

@NickCraver We suspect a networking issue in Azure, as we see disconnects of both Redis and SignalR connections in multiple environments that are distributed in a few geographies. Maybe the fact that the app is deployed in Azure Kubernetes Services plays a role in it. We are working with Microsoft Support to figure out that part.

What clues do you see that the socket is getting torn down in the dump? I dumped the SafeSocketHandle of the Socket from PhysicalConnection and its _state is 4, which means, it is not closed or disposed. safesockethandle.txt.

cosminvlad avatar Jul 26 '21 07:07 cosminvlad

I'm going to leave some findings from memory dump analysis that @cosminvlad was talking above. First, let me describe what we're observing:

  • Initially, it starts with a few hundred RedisTimeoutExceptions (based on Application Insights telemetry)
  • Then the exceptions stop but http requests are not advancing (based on current-requests event counter from Microsoft.AspNetCore.Hosting event source)
  • 7-8 min after initial timeout exceptions we managed to take a memory dump (one for 2.2.4 and another for 2.2.62 versions)
  • 15 min after the initial timeouts we usually observe the first RedisConnectionException (SocketClosed connection failures) and a burst of timeout exceptions (thousands)

The first thing we did is run !dumpasync sos command which outputs async state machine stats which revealed that most requests (7K) were "stuck" awaiting a redis command to complete. The oldest requests were a few minutes old, so one of our concern was why redis commands aren't timing out after 5000 ms.

In 2.2.4 memory dump, the PhysicalBridge backlog had 26K messages in it (for 2.2.62 which switched to a ConcurrentQueue I wasn't able to count the number of items, but I suspect a similar number). PhysicalConnection also had about 1K items in its _writtenAwaitingResponse queue. The difference between the oldest and newest message in both queues is about 7-8 minutes (based on Message.CreatedDateTime property).

I managed to find the head of the backlog and it was a FireAndForget message. If I'm reading the code correctly, the periodic "heartbeat" which checks the backlog for timeouts will not advance if the message doesn't have async timeout, like FireAndForget messages.

The PhysicalConnection is stuck in "Flushing" status:

0:000> !DumpObj /d 00007f0190abda48
Name:        StackExchange.Redis.PhysicalConnection
MethodTable: 00007f0bafc80fa8
EEClass:     00007f0bafc6d008
Size:        136(0x88) bytes
File:        /app/StackExchange.Redis.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007f0ba66e0b48  400029d        8        System.Byte[]  0 instance 0000000000000000 ChannelPrefix
00007f0bafc5e810  40002a5       58         System.Int32  1 instance                1 connectionType
00007f0bafc89030  40002a6       10 ...kExchange.Redis]]  0 instance 00007f0190abdad0 _writtenAwaitingResponse
00007f0ba6647a60  40002a7       18        System.String  0 instance 00007f0190abea68 _physicalName
00007f0ba658b158  40002a8       5c         System.Int32  1 instance                0 currentDatabase
00007f0bafc80e50  40002a9       78          System.Byte  1 instance                0 currentReadMode
00007f0ba658b158  40002aa       60         System.Int32  1 instance                0 failureReported
00007f0ba658b158  40002ab       64         System.Int32  1 instance         61676312 lastWriteTickCount
00007f0ba658b158  40002ac       68         System.Int32  1 instance         61659455 lastReadTickCount
00007f0ba658b158  40002ad       6c         System.Int32  1 instance         62115244 lastBeatTickCount
00007f0bafc5fc18  40002ae       20 ...lines.IDuplexPipe  0 instance 00007f0190ac13f0 _ioPipe
00007f0bab0c2990  40002af       28 ...et.Sockets.Socket  0 instance 00007f0190abec30 _socket
00007f0ba8ed5b90  40002b0       30 System.WeakReference  0 instance 00007f0190abea50 _bridge
00007f0ba658c4d8  40002b1       50         System.Int64  1 instance                0 <SubscriptionCount>k__BackingField
00007f0ba6587138  40002b2       79       System.Boolean  1 instance                0 <TransactionActive>k__BackingField
00007f0bafc7d528  40002b3       70         System.Int32  1 instance                3 _writeStatus
00007f0ba90a1680  40002b7       38 ...lationTokenSource  0 instance 00007f0190ac6f60 _reusableFlushSyncTokenSource
00007f0bafc7dfe8  40002bb       40 ...nge.Redis.Message  0 instance 0000000000000000 _activeMessage
00007f0bafc8a7c8  40002bd       48 ...kExchange.Redis]]  0 instance 00007f0190abdb10 _arena
00007f0bafc7d3d8  40002be       74         System.Int32  1 instance                5 _readStatus
00007f0bafc7adc0  400029f       a8 ...edis.CommandBytes  1   static 00007f079003bd48 message
00007f0bafc7adc0  40002a0       b0 ...edis.CommandBytes  1   static 00007f079003bd50 pmessage
00007f0bafc8ab98  40002a1       c8 ...e.Redis.Message[]  0   static 00007f0190abdc88 ReusableChangeDatabaseCommands
00007f0bafc7dfe8  40002a2       d0 ...nge.Redis.Message  0   static 00007f0190abe3a0 ReusableReadOnlyCommand
00007f0bafc7dfe8  40002a3       d8 ...nge.Redis.Message  0   static 00007f0190abe408 ReusableReadWriteCommand
00007f0ba658b158  40002a4      1d0         System.Int32  1   static                4 totalCount
00007f0ba7cfaa60  40002b8       b8 ...Private.CoreLib]]  1   static 00007f079003bd58 NullBulkString
00007f0ba7cfaa60  40002b9       c0 ...Private.CoreLib]]  1   static 00007f079003bd60 EmptyBulkString
00007f0bafc893c0  40002bc       e0 ...enas.ArenaOptions  0   static 00007f0190abe4b0 s_arenaOptions
00007f0ba79bad38  40002ba        8  System.Text.Encoder  0 TLstatic  s_PerThreadEncoder

Above, _writeStatus is Flushing and the difference between lastBeatTickCount and lastWriteTickCount is 438932 milliseconds (> 7min) so, if I'm reading the code correctly, lastWriteTickCount is updated just after the flush completes, which could mean the flush operation has been going on for > 7 min. This explains why backlog processing which also checks for timeouts isn't advancing and why are Redis commands not timing out after 5 secs.

Here's the tip of the "async stack" for the flush found with !dumpasync

00007f0196c0d3f0 00007f0bbb84bd08      168          0 StackExchange.Redis.PhysicalConnection+<FlushAsync_Awaited>d__79
Async "stack":
.00007f0196c0d4e0 (0) StackExchange.Redis.PhysicalBridge+<CompleteWriteAndReleaseLockAsync>d__100
..00007f0196c0d5a0 (0) StackExchange.Redis.ConnectionMultiplexer+<ExecuteAsyncImpl_Awaited>d__211`1[[System.Boolean, System.Private.CoreLib]]

I looked also into why the pipe's flush isn't completing, for that I dumped the _ioPipe._writePipe object:

0:000> !DumpObj /d 00007f0190ac13f0
Name:        Pipelines.Sockets.Unofficial.StreamConnection+AsyncStreamPipe
MethodTable: 00007f0bafcb1678
EEClass:     00007f0bafc9f118
Size:        64(0x40) bytes
File:        /app/Pipelines.Sockets.Unofficial.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007f0bafc73068  400012d        8 ...IO.Pipelines.Pipe  0 instance 00007f0190ac1430 _readPipe
00007f0bafc73068  400012e       10 ...IO.Pipelines.Pipe  0 instance 00007f0190ac16a0 _writePipe
00007f0ba6ae8af0  400012f       18     System.IO.Stream  0 instance 00007f0190ac0010 _inner
00007f0ba6647a60  4000130       20        System.String  0 instance 00007f0190abd8a0 <Name>k__BackingField
00007f0ba658c4d8  4000131       28         System.Int64  1 instance       9419807132 _totalBytesSent
00007f0ba658c4d8  4000132       30         System.Int64  1 instance       1945899205 _totalBytesReceived
0:000> !DumpObj /d 00007f0190ac16a0
Name:        System.IO.Pipelines.Pipe
MethodTable: 00007f0bafc73068
EEClass:     00007f0bafc66510
Size:        384(0x180) bytes
File:        /usr/share/dotnet/shared/Microsoft.AspNetCore.App/5.0.8/System.IO.Pipelines.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007f0ba6580c00  400001b        8        System.Object  0 instance 00007f0190ac1820 _sync
00007f0bae275a30  400001c       10 ...Private.CoreLib]]  0 instance 0000000000000000 _pool
00007f0ba658b158  400001d       78         System.Int32  1 instance             8192 _minimumSegmentSize
00007f0ba658c4d8  400001e       50         System.Int64  1 instance           524288 _pauseWriterThreshold
00007f0ba658c4d8  400001f       58         System.Int64  1 instance           262144 _resumeWriterThreshold
00007f0bae27f4b8  4000020       18 ...nes.PipeScheduler  0 instance 00007f0190ab65b0 _readerScheduler
00007f0bae27f4b8  4000021       20 ...nes.PipeScheduler  0 instance 00007f0190ab65b0 _writerScheduler
00007f0bafc725d8  4000022       90 ...ufferSegmentStack  1 instance 00007f0190ac1730 _bufferSegmentPool
00007f0bafc73920  4000023       28 ...DefaultPipeReader  0 instance 00007f0190ac18d0 _reader
00007f0bafc73c30  4000024       30 ...DefaultPipeWriter  0 instance 00007f0190ac18f0 _writer
00007f0ba6587138  4000025       8c       System.Boolean  1 instance                0 _useSynchronizationContext
00007f0ba658c4d8  4000026       60         System.Int64  1 instance           524377 _unconsumedBytes
00007f0ba658c4d8  4000027       68         System.Int64  1 instance                0 _unflushedBytes
00007f0bafc72980  4000028       a0 ...nes.PipeAwaitable  1 instance 00007f0190ac1740 _readerAwaitable
00007f0bafc72980  4000029       d8 ...nes.PipeAwaitable  1 instance 00007f0190ac1778 _writerAwaitable
00007f0bafc72c98  400002a      110 ...es.PipeCompletion  1 instance 00007f0190ac17b0 _writerCompletion
00007f0bafc72c98  400002b      138 ...es.PipeCompletion  1 instance 00007f0190ac17d8 _readerCompletion
00007f0ba658c4d8  400002c       70         System.Int64  1 instance                0 _lastExaminedIndex
00007f0bafc74ad0  400002d       38 ...nes.BufferSegment  0 instance 00007f0190ac1910 _readHead
00007f0ba658b158  400002e       7c         System.Int32  1 instance                0 _readHeadIndex
00007f0ba658b158  400002f       80         System.Int32  1 instance                0 _maxPooledBufferSize
00007f0ba6587138  4000030       8d       System.Boolean  1 instance                0 _disposed
00007f0bafc74ad0  4000031       40 ...nes.BufferSegment  0 instance 00007f0196c0b338 _readTail
00007f0ba658b158  4000032       84         System.Int32  1 instance             1366 _readTailIndex
00007f0bafc74ad0  4000033       48 ...nes.BufferSegment  0 instance 00007f0196c0b338 _writingHead
00007f0ba7cfb158  4000034      160 ...Private.CoreLib]]  1 instance 00007f0190ac1800 _writingHeadMemory
00007f0ba658b158  4000035       88         System.Int32  1 instance                0 _writingHeadBytesBuffered
00007f0bafc72f38  4000036      170 ...ipeOperationState  1 instance 00007f0190ac1810 _operationState
00007f0ba91129d8  4000014       10 ...Private.CoreLib]]  0   static 00007f02105db2c0 s_signalReaderAwaitable
00007f0ba91129d8  4000015       18 ...Private.CoreLib]]  0   static 00007f02105db318 s_signalWriterAwaitable
00007f0ba91129d8  4000016       20 ...Private.CoreLib]]  0   static 00007f02105db358 s_invokeCompletionCallbacks
00007f0ba9112b30  4000017       28 ...g.ContextCallback  0   static 00007f02105db398 s_executionContextRawCallback
00007f0ba924d680  4000018       30 ...endOrPostCallback  0   static 00007f02105db3d8 s_syncContextExecutionContextCallback
00007f0ba924d680  4000019       38 ...endOrPostCallback  0   static 00007f02105db418 s_syncContextExecuteWithoutExecutionContextCallback
00007f0ba91129d8  400001a       40 ...Private.CoreLib]]  0   static 00007f02105db458 s_scheduleWithExecutionContextCallback

Above, _unconsumedBytes is greater than _pauseWriterThreshold, I'm assuming that means the flush isn't completing because of backpressure - the reader on the other end of the pipe isn't advancing fast enough or not at all. I went looking for CopyFromWritePipeToStream from Pipelines.Sockets.Unofficial async state machines and found this (the Pipelines.Sockets.Unofficial.StreamConnection+AsyncStreamPipe reference on the state machine matches the one in PhysicalConnection displayed above)

00007f0194eabcc0 00007f0bb34e0b90 144 0 System.Net.Security.SslStream+<<WriteSingleChunk>g__CompleteWriteAsync|177_1>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]]
Async "stack":
.00007f0194eabd50 (0) System.Net.Security.SslStream+<WriteAsyncInternal>d__184`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]]
..00007f03108acc00 (2) Pipelines.Sockets.Unofficial.StreamConnection+AsyncStreamPipe+<CopyFromWritePipeToStream>d__20
...00007f03108acd20 (Pipelines.Sockets.Unofficial.Helpers+<>c.<PipelinesFireAndForget>b__14_0(System.Threading.Tasks.Task)) System.Threading.Tasks.ContinuationTaskFromTask

It looks like reading from writer pipe (CopyFromWritePipeToStream) isn't advancing because it's stuck on a SslStream.WriteAsyncInternal call. This is as far as I've got with the dump analysis, I hope I got at least some of it right.

We've suspected this to be a network issue, however Azure support engineers haven't found any evidence of network instability from their metrics and the tcp dumps taken.

vas6ili avatar Jul 26 '21 16:07 vas6ili

@vas6ili @cosminvlad is it possible for you to share the Azure ticket number?

deepakverma avatar Jul 27 '21 16:07 deepakverma

@deepakverma Sure, ticket number is 2107210050001789

vas6ili avatar Jul 27 '21 17:07 vas6ili

For the past few days the exception began to happen again several times a day:

SocketClosed (ReadEndOfStream, last-recv: 0) on xyz.redis.cache.windows.net:6379/Interactive, Idle/MarkProcessed, last: INFO, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 44s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325

DarthSonic avatar Feb 11 '22 09:02 DarthSonic

cc @philon-msft since "several times a day" is odd for failover situations

NickCraver avatar Feb 11 '22 12:02 NickCraver

@DarthSonic there are a few factors that can cause errors like this for Redis client apps running in Kubernetes. Typically, it's noisy neighbor pods, service mesh sidecars like Istio intercepting traffic, or updates being applied to cluster nodes. The unique requirements in Redis clients for persistent connections and unusual ports can make them more sensitive to the complex environment of a cluster.

Reviewing recent changes in app deployments or environment configuration might provide an explanation. Failing that, a packet capture during a connection loss may help.

philon-msft avatar Feb 11 '22 21:02 philon-msft

For the past days, we noticed the SocketClosed exception: Idle/MarkProcessed, last: PING, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 22s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.79.4591

AngelaCalboreanVisma avatar Mar 04 '22 04:03 AngelaCalboreanVisma

@AngelaCalboreanVisma if you're using an Azure Cache for Redis, your cache was probably undergoing routine planned maintenance. For more info see: https://docs.microsoft.com/en-us/azure/azure-cache-for-redis/cache-failover

See the "Build in resiliency" section in that document for guidance on how to make your application resilient to the connection failovers that occur during maintenance in cloud-hosted Redis.

philon-msft avatar Mar 04 '22 04:03 philon-msft

Best advice is above - if anyone else is hitting this please upgrade to a newer version where we handle disconnects more and more gracefully/automatically. Closing out to tidy up.

NickCraver avatar Aug 21 '22 14:08 NickCraver