NetCoreServer icon indicating copy to clipboard operation
NetCoreServer copied to clipboard

Deadlock in WsSession/WebSocket

Open stackrainbow opened this issue 1 year ago • 1 comments

Hello,

I'm experiencing a deadlock in WsSession/WebSocket.

What happens seems to be an unlucky and rare set of events, but we see it in production:

My thread attempts to send data to the connected session through SendTextAsync. SendTextAsync fails, and calls Disconnect() in ProcessSend(). As far as I can tell, this involves locking WebSocket.WsSendLock. It eventually arrives at ClearWsBuffers(), but cannot continue because the receive lock is in use.

At the same time, in another thread, NetCoreServer seems to have received data from the client to close the websocket. This involves locking WebSocket.WsReceiveLock. At this point, this thread tries to SendCloseAsync (I don't understand the protocol? But perhaps an acknowledgement). It however, freezes on needing WsSendLock.

At this point, both these threads are gone, deadlocked waiting on each other, and the session is in a bad state.

Here is a snippet of the dotnet-trace, showing this: Thread 1

OS Thread Id: 0x5a4ad
        Child SP               IP Call Site
00007FB920FF8650 00007fbce6e002ea [HelperMethodFrame_1OBJ: 00007fb920ff8650] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00007FB920FF87A0 00007FBC6F5838CA NetCoreServer.WsSession.SendCloseAsync(Int32, Byte[], Int64, Int64)
00007FB920FF8800 00007FBC6F58384A NetCoreServer.WsSession.Close(Int32)
00007FB920FF8820 00007FBC6F583813 NetCoreServer.WsSession.OnWsClose(Byte[], Int64, Int64)
00007FB920FF8830 00007FBC6E03F880 NetCoreServer.WebSocket.PrepareReceiveFrame(Byte[], Int64, Int64)
00007FB920FF88F0 00007FBC6F40CE0F NetCoreServer.TcpSession.ProcessReceive(System.Net.Sockets.SocketAsyncEventArgs)
00007FB920FF8920 00007FBC6F40FE40 NetCoreServer.TcpSession.OnAsyncCompleted(System.Object, System.Net.Sockets.SocketAsyncEventArgs)
00007FB920FF8940 00007FBC6F40CBAB System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
00007FB920FF89A0 00007FBC6F404BAC System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
00007FB920FF89E0 00007FBC6F3FFAFB System.Threading.ThreadPoolWorkQueue.Dispatch()
00007FB920FF8A50 00007FBC6F584DAA System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()
00007FB920FF8C50 00007fbce68fd353 [DebuggerU2MCatchHandlerFrame: 00007fb920ff8c50] 

Thread 2

OS Thread Id: 0x5344c
        Child SP               IP Call Site
00007FBC27FFE780 00007fbce6e002ea [HelperMethodFrame_1OBJ: 00007fbc27ffe780] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00007FBC27FFE8D0 00007FBC6F423643 NetCoreServer.WebSocket.ClearWsBuffers()
00007FBC27FFE900 00007FBC6F5835B3 NetCoreServer.WsSession.OnDisconnected()
00007FBC27FFE920 00007FBC6F58214A NetCoreServer.TcpSession.Disconnect()
00007FBC27FFE980 00007FBC6F40E0F7 NetCoreServer.TcpSession.ProcessSend(System.Net.Sockets.SocketAsyncEventArgs)
00007FBC27FFE9A0 00007FBC6E027367 NetCoreServer.TcpSession.TrySend()
00007FBC27FFE9E0 00007FBC6F40D68C NetCoreServer.TcpSession.SendAsync(Byte[], Int64, Int64)
00007FBC27FFEA30 00007FBC6F4161A0 NetCoreServer.WsSession.SendTextAsync(System.String)
00007FBC27FFEA70 00007FBC6E014331 RealtimeServer.IPSHelpers.OutboundQueueWorker()
00007FBC27FFEAC0 00007FBC6E013E98 RealtimeServer.IPSHelpers+<>c.<CreateQueueWorker>b__2_0()
00007FBC27FFEAE0 00007FBC6CC49F88 System.Threading.Thread.StartCallback()
00007FBC27FFEC50 00007fbce68fd353 [DebuggerU2MCatchHandlerFrame: 00007fbc27ffec50] 

At the moment, I have worked around this by retrieving the WsReceiveLock via Reflection and locking on that before executing any Send* calls (due to the potential WsReceiveLock could otherwise be locked and needed later on in the call stack) -- this has resolved it for my use case but is probably not the best fix.

stackrainbow avatar Jul 25 '22 10:07 stackrainbow

@stackrainbow please check the fix in 6.3.0

chronoxor avatar Jul 25 '22 14:07 chronoxor