StackExchange.Redis
StackExchange.Redis copied to clipboard
StackExchange.Redis.RedisConnectionException: SocketClosed on Azure Redis Cache
In last 2 months, we have encountered many connection errors like "No connection is available", but the most blocking are SocketClosed.
First error type lasts some seconds and then goes away itself, but the second (SockeClosed) blocks all connections and Redis realted application tasks until we restart manually the machine and/or killing application pool.
Error details: StackExchange.Redis.RedisConnectionException: SocketClosed on StackExchange.Redis.RedisConnectionException: No connection is available to service this operation: GET geoinfos_states; SocketClosed on XXXXXXXXXX.redis.cache.windows.net:6380/Interactive, Idle/MarkProcessed, last: SETEX, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 54s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.0.601.3402; IOCP: (Busy=1,Free=999,Min=4,Max=1000), WORKER: (Busy=0,Free=32767,Min=4,Max=32767), Local-CPU: n/a ---> StackExchange.Redis.RedisConnectionException: SocketClosed on XXXXXXXXXX.redis.cache.windows.net:6380/Interactive, Idle/MarkProcessed, last: SETEX, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 54s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.0.601.3402
Code for Redis connections:
private static Lazy<ConnectionMultiplexer> lazyConnection = new Lazy<ConnectionMultiplexer>(() => ConnectionMultiplexer.Connect(GetConnectionString("RedisCacheConnection")));
public static ConnectionMultiplexer Connection { get { return lazyConnection.Value; } }
In other issues here, upgrading the library does not resolve definitely the issues. Can we manage and/or get a workaround for these connection error (with priority for the socket issue) without restarting the service (for example recreating the multiplexer)?
Version in use: v2.0.601.3402 Related Service: Azure Redis Cache C2 tier with single instance Our application is running .net v4.7.2
Connection pattern in use: XXXXXXXXXX.redis.cache.windows.net,abortConnect=false,ssl=true,password=XXXXXXXXXX,syncTimeout=30000,allowAdmin=true
There are additional fixes to this in 2.2.x series, but we're working with the Azure team to detect when they do maintenance on the server side and make the library handle this much quicker when a failover is going to happen behind the scenes.
cc @deepakverma on this one - something we're actively looking at.
@NickCraver Is this the issue to follow to see updates on this situation, or is there another one that tracks related development?
Can we manage and/or get a workaround for these connection error (with priority for the socket issue) without restarting the service (for example recreating the multiplexer)? @ms92ita here's an approach to follow https://gist.github.com/JonCole/925630df72be1351b21440625ff2671f#reconnecting-with-lazyt-pattern
We've been working hard on connections, especially in cloud/high latency environments. For anyone hitting connect and reconnect issues especially, please give 2.2.50 or higher a try (available in NuGet) - it has quite a few improvements in this area and better logging for anything hit. If you still see an issue after upgrading please post the connection logs, as they'll be more helpful than ever.
Still these errors on 2.2.62. Occur randomly and if they do, then always several in a row. After that, it is quiet again, sometimes for days.
SocketClosed (ReadEndOfStream, last-recv: 0) on xyz.redis.cache.windows.net:6379/Interactive, Idle/MarkProcessed, last: PING, origin: ReadFromPipe, outstanding: 1, last-read: 0s ago, last-write: 0s ago, unanswered-write: 0s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.62.27853
@NickCraver We have a memory dump from when this issue was reproducing on 2.2.4. We just upgraded to 2.2.62 and are currently monitoring to see if it happens again. Unfortunately, we cannot share the memory dump, as it's from our production systems, but we would be more than happy to provide any kind of info from it (state of any StackExchange.Redis object). Would that help? L.E: The issue happened with 2.2.62 too, we have a memory dump from this one also.
We are getting the following message 8 times, but that happens more than 15 minutes after everything blocks in Redis:
SocketClosed on xxx.redis.cache.windows.net:6380/Interactive, Flushing/MarkProcessed, last: HMSET, origin: ReadFromPipe, outstanding: 1406, last-read: 0s ago, last-write: 922s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 8 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.62.27853
This can be seen only on one out of 6-8 containers. It happened several days in a row and not always with the same pod.
I'm also attaching an analysis done on version 2.2.62, were you can see some of the internal state of the library. Happy to provide more if needed: core_20210724_063807-redis-analysis.txt
@cosminvlad Is anything happening on the server side here? It looks like the socket is actually getting torn down - any reason that may be happening in the environment?
@NickCraver We suspect a networking issue in Azure, as we see disconnects of both Redis and SignalR connections in multiple environments that are distributed in a few geographies. Maybe the fact that the app is deployed in Azure Kubernetes Services plays a role in it. We are working with Microsoft Support to figure out that part.
What clues do you see that the socket is getting torn down in the dump? I dumped the SafeSocketHandle of the Socket from PhysicalConnection and its _state is 4, which means, it is not closed or disposed. safesockethandle.txt.
I'm going to leave some findings from memory dump analysis that @cosminvlad was talking above. First, let me describe what we're observing:
- Initially, it starts with a few hundred
RedisTimeoutException
s (based on Application Insights telemetry) - Then the exceptions stop but http requests are not advancing (based on
current-requests
event counter fromMicrosoft.AspNetCore.Hosting
event source) - 7-8 min after initial timeout exceptions we managed to take a memory dump (one for
2.2.4
and another for2.2.62
versions) - 15 min after the initial timeouts we usually observe the first
RedisConnectionException
(SocketClosed
connection failures) and a burst of timeout exceptions (thousands)
The first thing we did is run !dumpasync
sos command which outputs async state machine stats which revealed that most requests (7K) were "stuck" awaiting a redis command to complete. The oldest requests were a few minutes old, so one of our concern was why redis commands aren't timing out after 5000 ms.
In 2.2.4
memory dump, the PhysicalBridge
backlog had 26K messages in it (for 2.2.62
which switched to a ConcurrentQueue
I wasn't able to count the number of items, but I suspect a similar number). PhysicalConnection
also had about 1K items in its _writtenAwaitingResponse
queue. The difference between the oldest and newest message in both queues is about 7-8 minutes (based on Message.CreatedDateTime
property).
I managed to find the head of the backlog and it was a FireAndForget
message. If I'm reading the code correctly, the periodic "heartbeat" which checks the backlog for timeouts will not advance if the message doesn't have async timeout, like FireAndForget
messages.
The PhysicalConnection
is stuck in "Flushing" status:
0:000> !DumpObj /d 00007f0190abda48
Name: StackExchange.Redis.PhysicalConnection
MethodTable: 00007f0bafc80fa8
EEClass: 00007f0bafc6d008
Size: 136(0x88) bytes
File: /app/StackExchange.Redis.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007f0ba66e0b48 400029d 8 System.Byte[] 0 instance 0000000000000000 ChannelPrefix
00007f0bafc5e810 40002a5 58 System.Int32 1 instance 1 connectionType
00007f0bafc89030 40002a6 10 ...kExchange.Redis]] 0 instance 00007f0190abdad0 _writtenAwaitingResponse
00007f0ba6647a60 40002a7 18 System.String 0 instance 00007f0190abea68 _physicalName
00007f0ba658b158 40002a8 5c System.Int32 1 instance 0 currentDatabase
00007f0bafc80e50 40002a9 78 System.Byte 1 instance 0 currentReadMode
00007f0ba658b158 40002aa 60 System.Int32 1 instance 0 failureReported
00007f0ba658b158 40002ab 64 System.Int32 1 instance 61676312 lastWriteTickCount
00007f0ba658b158 40002ac 68 System.Int32 1 instance 61659455 lastReadTickCount
00007f0ba658b158 40002ad 6c System.Int32 1 instance 62115244 lastBeatTickCount
00007f0bafc5fc18 40002ae 20 ...lines.IDuplexPipe 0 instance 00007f0190ac13f0 _ioPipe
00007f0bab0c2990 40002af 28 ...et.Sockets.Socket 0 instance 00007f0190abec30 _socket
00007f0ba8ed5b90 40002b0 30 System.WeakReference 0 instance 00007f0190abea50 _bridge
00007f0ba658c4d8 40002b1 50 System.Int64 1 instance 0 <SubscriptionCount>k__BackingField
00007f0ba6587138 40002b2 79 System.Boolean 1 instance 0 <TransactionActive>k__BackingField
00007f0bafc7d528 40002b3 70 System.Int32 1 instance 3 _writeStatus
00007f0ba90a1680 40002b7 38 ...lationTokenSource 0 instance 00007f0190ac6f60 _reusableFlushSyncTokenSource
00007f0bafc7dfe8 40002bb 40 ...nge.Redis.Message 0 instance 0000000000000000 _activeMessage
00007f0bafc8a7c8 40002bd 48 ...kExchange.Redis]] 0 instance 00007f0190abdb10 _arena
00007f0bafc7d3d8 40002be 74 System.Int32 1 instance 5 _readStatus
00007f0bafc7adc0 400029f a8 ...edis.CommandBytes 1 static 00007f079003bd48 message
00007f0bafc7adc0 40002a0 b0 ...edis.CommandBytes 1 static 00007f079003bd50 pmessage
00007f0bafc8ab98 40002a1 c8 ...e.Redis.Message[] 0 static 00007f0190abdc88 ReusableChangeDatabaseCommands
00007f0bafc7dfe8 40002a2 d0 ...nge.Redis.Message 0 static 00007f0190abe3a0 ReusableReadOnlyCommand
00007f0bafc7dfe8 40002a3 d8 ...nge.Redis.Message 0 static 00007f0190abe408 ReusableReadWriteCommand
00007f0ba658b158 40002a4 1d0 System.Int32 1 static 4 totalCount
00007f0ba7cfaa60 40002b8 b8 ...Private.CoreLib]] 1 static 00007f079003bd58 NullBulkString
00007f0ba7cfaa60 40002b9 c0 ...Private.CoreLib]] 1 static 00007f079003bd60 EmptyBulkString
00007f0bafc893c0 40002bc e0 ...enas.ArenaOptions 0 static 00007f0190abe4b0 s_arenaOptions
00007f0ba79bad38 40002ba 8 System.Text.Encoder 0 TLstatic s_PerThreadEncoder
Above, _writeStatus
is Flushing
and the difference between lastBeatTickCount
and lastWriteTickCount
is 438932
milliseconds (> 7min) so, if I'm reading the code correctly, lastWriteTickCount
is updated just after the flush completes, which could mean the flush operation has been going on for > 7 min. This explains why backlog processing which also checks for timeouts isn't advancing and why are Redis commands not timing out after 5 secs.
Here's the tip of the "async stack" for the flush found with !dumpasync
00007f0196c0d3f0 00007f0bbb84bd08 168 0 StackExchange.Redis.PhysicalConnection+<FlushAsync_Awaited>d__79
Async "stack":
.00007f0196c0d4e0 (0) StackExchange.Redis.PhysicalBridge+<CompleteWriteAndReleaseLockAsync>d__100
..00007f0196c0d5a0 (0) StackExchange.Redis.ConnectionMultiplexer+<ExecuteAsyncImpl_Awaited>d__211`1[[System.Boolean, System.Private.CoreLib]]
I looked also into why the pipe's flush isn't completing, for that I dumped the _ioPipe._writePipe
object:
0:000> !DumpObj /d 00007f0190ac13f0
Name: Pipelines.Sockets.Unofficial.StreamConnection+AsyncStreamPipe
MethodTable: 00007f0bafcb1678
EEClass: 00007f0bafc9f118
Size: 64(0x40) bytes
File: /app/Pipelines.Sockets.Unofficial.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007f0bafc73068 400012d 8 ...IO.Pipelines.Pipe 0 instance 00007f0190ac1430 _readPipe
00007f0bafc73068 400012e 10 ...IO.Pipelines.Pipe 0 instance 00007f0190ac16a0 _writePipe
00007f0ba6ae8af0 400012f 18 System.IO.Stream 0 instance 00007f0190ac0010 _inner
00007f0ba6647a60 4000130 20 System.String 0 instance 00007f0190abd8a0 <Name>k__BackingField
00007f0ba658c4d8 4000131 28 System.Int64 1 instance 9419807132 _totalBytesSent
00007f0ba658c4d8 4000132 30 System.Int64 1 instance 1945899205 _totalBytesReceived
0:000> !DumpObj /d 00007f0190ac16a0
Name: System.IO.Pipelines.Pipe
MethodTable: 00007f0bafc73068
EEClass: 00007f0bafc66510
Size: 384(0x180) bytes
File: /usr/share/dotnet/shared/Microsoft.AspNetCore.App/5.0.8/System.IO.Pipelines.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007f0ba6580c00 400001b 8 System.Object 0 instance 00007f0190ac1820 _sync
00007f0bae275a30 400001c 10 ...Private.CoreLib]] 0 instance 0000000000000000 _pool
00007f0ba658b158 400001d 78 System.Int32 1 instance 8192 _minimumSegmentSize
00007f0ba658c4d8 400001e 50 System.Int64 1 instance 524288 _pauseWriterThreshold
00007f0ba658c4d8 400001f 58 System.Int64 1 instance 262144 _resumeWriterThreshold
00007f0bae27f4b8 4000020 18 ...nes.PipeScheduler 0 instance 00007f0190ab65b0 _readerScheduler
00007f0bae27f4b8 4000021 20 ...nes.PipeScheduler 0 instance 00007f0190ab65b0 _writerScheduler
00007f0bafc725d8 4000022 90 ...ufferSegmentStack 1 instance 00007f0190ac1730 _bufferSegmentPool
00007f0bafc73920 4000023 28 ...DefaultPipeReader 0 instance 00007f0190ac18d0 _reader
00007f0bafc73c30 4000024 30 ...DefaultPipeWriter 0 instance 00007f0190ac18f0 _writer
00007f0ba6587138 4000025 8c System.Boolean 1 instance 0 _useSynchronizationContext
00007f0ba658c4d8 4000026 60 System.Int64 1 instance 524377 _unconsumedBytes
00007f0ba658c4d8 4000027 68 System.Int64 1 instance 0 _unflushedBytes
00007f0bafc72980 4000028 a0 ...nes.PipeAwaitable 1 instance 00007f0190ac1740 _readerAwaitable
00007f0bafc72980 4000029 d8 ...nes.PipeAwaitable 1 instance 00007f0190ac1778 _writerAwaitable
00007f0bafc72c98 400002a 110 ...es.PipeCompletion 1 instance 00007f0190ac17b0 _writerCompletion
00007f0bafc72c98 400002b 138 ...es.PipeCompletion 1 instance 00007f0190ac17d8 _readerCompletion
00007f0ba658c4d8 400002c 70 System.Int64 1 instance 0 _lastExaminedIndex
00007f0bafc74ad0 400002d 38 ...nes.BufferSegment 0 instance 00007f0190ac1910 _readHead
00007f0ba658b158 400002e 7c System.Int32 1 instance 0 _readHeadIndex
00007f0ba658b158 400002f 80 System.Int32 1 instance 0 _maxPooledBufferSize
00007f0ba6587138 4000030 8d System.Boolean 1 instance 0 _disposed
00007f0bafc74ad0 4000031 40 ...nes.BufferSegment 0 instance 00007f0196c0b338 _readTail
00007f0ba658b158 4000032 84 System.Int32 1 instance 1366 _readTailIndex
00007f0bafc74ad0 4000033 48 ...nes.BufferSegment 0 instance 00007f0196c0b338 _writingHead
00007f0ba7cfb158 4000034 160 ...Private.CoreLib]] 1 instance 00007f0190ac1800 _writingHeadMemory
00007f0ba658b158 4000035 88 System.Int32 1 instance 0 _writingHeadBytesBuffered
00007f0bafc72f38 4000036 170 ...ipeOperationState 1 instance 00007f0190ac1810 _operationState
00007f0ba91129d8 4000014 10 ...Private.CoreLib]] 0 static 00007f02105db2c0 s_signalReaderAwaitable
00007f0ba91129d8 4000015 18 ...Private.CoreLib]] 0 static 00007f02105db318 s_signalWriterAwaitable
00007f0ba91129d8 4000016 20 ...Private.CoreLib]] 0 static 00007f02105db358 s_invokeCompletionCallbacks
00007f0ba9112b30 4000017 28 ...g.ContextCallback 0 static 00007f02105db398 s_executionContextRawCallback
00007f0ba924d680 4000018 30 ...endOrPostCallback 0 static 00007f02105db3d8 s_syncContextExecutionContextCallback
00007f0ba924d680 4000019 38 ...endOrPostCallback 0 static 00007f02105db418 s_syncContextExecuteWithoutExecutionContextCallback
00007f0ba91129d8 400001a 40 ...Private.CoreLib]] 0 static 00007f02105db458 s_scheduleWithExecutionContextCallback
Above, _unconsumedBytes
is greater than _pauseWriterThreshold
, I'm assuming that means the flush isn't completing because of backpressure - the reader on the other end of the pipe isn't advancing fast enough or not at all. I went looking for CopyFromWritePipeToStream
from Pipelines.Sockets.Unofficial
async state machines and found this (the Pipelines.Sockets.Unofficial.StreamConnection+AsyncStreamPipe
reference on the state machine matches the one in PhysicalConnection
displayed above)
00007f0194eabcc0 00007f0bb34e0b90 144 0 System.Net.Security.SslStream+<<WriteSingleChunk>g__CompleteWriteAsync|177_1>d`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]]
Async "stack":
.00007f0194eabd50 (0) System.Net.Security.SslStream+<WriteAsyncInternal>d__184`1[[System.Net.Security.AsyncReadWriteAdapter, System.Net.Security]]
..00007f03108acc00 (2) Pipelines.Sockets.Unofficial.StreamConnection+AsyncStreamPipe+<CopyFromWritePipeToStream>d__20
...00007f03108acd20 (Pipelines.Sockets.Unofficial.Helpers+<>c.<PipelinesFireAndForget>b__14_0(System.Threading.Tasks.Task)) System.Threading.Tasks.ContinuationTaskFromTask
It looks like reading from writer pipe (CopyFromWritePipeToStream) isn't advancing because it's stuck on a SslStream.WriteAsyncInternal
call. This is as far as I've got with the dump analysis, I hope I got at least some of it right.
We've suspected this to be a network issue, however Azure support engineers haven't found any evidence of network instability from their metrics and the tcp dumps taken.
@vas6ili @cosminvlad is it possible for you to share the Azure ticket number?
@deepakverma Sure, ticket number is 2107210050001789
For the past few days the exception began to happen again several times a day:
SocketClosed (ReadEndOfStream, last-recv: 0) on xyz.redis.cache.windows.net:6379/Interactive, Idle/MarkProcessed, last: INFO, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 44s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, in-pipe: 0, out-pipe: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.88.56325
cc @philon-msft since "several times a day" is odd for failover situations
@DarthSonic there are a few factors that can cause errors like this for Redis client apps running in Kubernetes. Typically, it's noisy neighbor pods, service mesh sidecars like Istio intercepting traffic, or updates being applied to cluster nodes. The unique requirements in Redis clients for persistent connections and unusual ports can make them more sensitive to the complex environment of a cluster.
Reviewing recent changes in app deployments or environment configuration might provide an explanation. Failing that, a packet capture during a connection loss may help.
For the past days, we noticed the SocketClosed exception: Idle/MarkProcessed, last: PING, origin: ReadFromPipe, outstanding: 0, last-read: 0s ago, last-write: 22s ago, keep-alive: 60s, state: ConnectedEstablished, mgr: 9 of 10 available, in: 0, last-heartbeat: 0s ago, last-mbeat: 0s ago, global: 0s ago, v: 2.2.79.4591
@AngelaCalboreanVisma if you're using an Azure Cache for Redis, your cache was probably undergoing routine planned maintenance. For more info see: https://docs.microsoft.com/en-us/azure/azure-cache-for-redis/cache-failover
See the "Build in resiliency" section in that document for guidance on how to make your application resilient to the connection failovers that occur during maintenance in cloud-hosted Redis.
Best advice is above - if anyone else is hitting this please upgrade to a newer version where we handle disconnects more and more gracefully/automatically. Closing out to tidy up.