Deadlock calling SCTPTransport Stop
Your environment.
- Version: v3.1.7
What did you do?
SCTPTransport's Stop method was called in ion-sfu relay peer's Close method.
What did you expect?
The underlying SCTPTransport to be stopped in a timely manner.
What happened?
The call to Stop seems to have caused a deadlock; the method never returns. Here is the relevant block trace from pprof:
108214 2 @ 0x7cb6f0 0x7cb6df 0x481102 0x7cb645 0x7cb5f5 0x7f7dc2 0x7f30c5 0x80b128 0x80aaf8 0x8ecc78 0x90cc88 0x18edbca 0x476a61
# 0x7cb6ef sync.(*Mutex).Lock+0x6f /usr/local/go/src/sync/mutex.go:90
# 0x7cb6de github.com/pion/transport/connctx.(*connCtx).Close.func1+0x5e /go/pkg/mod/github.com/pion/[email protected]/connctx/connctx.go:154
# 0x481101 sync.(*Once).doSlow+0xc1 /usr/local/go/src/sync/once.go:74
# 0x7cb644 sync.(*Once).Do+0x64 /usr/local/go/src/sync/once.go:65
# 0x7cb5f4 github.com/pion/transport/connctx.(*connCtx).Close+0x14 /go/pkg/mod/github.com/pion/[email protected]/connctx/connctx.go:152
# 0x7f7dc1 github.com/pion/dtls/v2.(*Conn).close+0x1a1 /go/pkg/mod/github.com/pion/dtls/[email protected]/conn.go:921
# 0x7f30c4 github.com/pion/dtls/v2.(*Conn).Close+0x24 /go/pkg/mod/github.com/pion/dtls/[email protected]/conn.go:341
# 0x80b127 github.com/pion/sctp.(*Association).close+0xc7 /go/pkg/mod/github.com/pion/[email protected]/association.go:463
# 0x80aaf7 github.com/pion/sctp.(*Association).Close+0xd7 /go/pkg/mod/github.com/pion/[email protected]/association.go:443
# 0x8ecc77 github.com/pion/webrtc/v3.(*SCTPTransport).Stop+0x77 /go/pkg/mod/github.com/pion/webrtc/[email protected]/sctptransport.go:130
# 0x90cc87 github.com/pion/ion-sfu/pkg/relay.(*Peer).Close+0x287 /go/pkg/mod/github.com/playback-sports/[email protected]/pkg/relay/relay.go:336
# 0x18edbc9 github.com/playback-sports/relayer/pkg.(*Relayer).Start+0x9e9 /build/pkg/relayer.go:256
This happens very rarely but figured I'd file a report anyway.
I am experiencing similar problem, might be realted to this. Although the issue existed even before (very rarely) after recent changes it is happening more often. I tracked it down to tcp-mux and this commit. There was also earlier commit that attempted to mitigate known problem about blocking channel, though did not fix the behavior.
@tab1293 I was able to reproduce my issue in tcp-mux by turning on airplane mode, so the connection is left open and Write function is blocking until its available. I was able to fix this by adding WriteBufferSize to `NewTCPMuxDefault. Could you try it with your issue (if those steps apply, if you even use TCP mux in the first place) if thats maybe the same problem?
Nothing actionable on this one unfortunately. Sorry this didn't get addressed when you filed it originally @tab1293
I assume that LiveKit picked this up in all their TCP Mux work and hasn't been an issue for a while. If I see any patterns/reports I will come back to this though!