sctp icon indicating copy to clipboard operation
sctp copied to clipboard

Endless buffering of outbound traffic when network is loaded

Open daonb opened this issue 3 years ago • 14 comments

Your environment.

  • Version: 1.7.11
  • Browser: iPadOS 14.3

What did you do?

I developed a terminal server over pion/webrtc, using data channels to pipe input & output. Things work well on a clean network and on a mobile one. I even used netem to emulate high jitter scenario and all works well. Only when I use my home network - a small pipe with two zoom sessions going on in parallel - the terminal stops receiving data.

I'm attaching a log where I have three data channels opened on the same peer. From the logs it looks like the buffer keeps growing and growing and growing

frozen2.log

daonb avatar Feb 01 '21 19:02 daonb

@daonb That's what happens when you don't check bufferedAmount. See: https://github.com/pion/webrtc/tree/master/examples/data-channels-flow-control#when-do-we-need-it

enobufs avatar Feb 14 '21 20:02 enobufs

Related topic: #77

enobufs avatar Feb 14 '21 21:02 enobufs

@enobufs thanks for the reply. I don't think that's the issue I'm seeing for two reasons:

  • I have a very low data sending rate - < 1Kbps
  • The channel compeletly stops sending data and doesn't resume even after congestion is cleared

I believe there's a bug in congestion recovery but I have no tests to prove it :-( To test it I need a client that can be in "buffering mode" where outgoing datagrams are buffered for XXX ms rather than sent. This way we can write tests that simulate congestion and better test the congestion control code.

daonb avatar Feb 18 '21 10:02 daonb

@daonb It's hard to debug without reproducible case... What I saw:

Many data channels are used

% cat frozen2.log|grep 'data channel'
info	Deleting data channel 22
info	Ading data channel 27
info	Deleting data channel 24
info	Deleting data channel 27
info	Deleting data channel 25
info	Deleting data channel 26

SACK stops coming in from the peer after 23:23:33.054958

Last TSN sack'ed was 903304684 But DATA chunk is coming in. The log covers the last 23 seconds. During the time, total of 37 DATA chunks were received

% cat frozen2.log |grep '] DATA' |wc -l
      37

One T3-rtx timeout is in the log but it was the 5th RTO

% cat frozen2.log |grep '] T3-rtx'
sctp TRACE: 23:23:33.054718 association.go:1523: [0xc000091380] T3-rtx timer start (pt3)
sctp TRACE: 23:23:33.054935 association.go:1523: [0xc000091380] T3-rtx timer start (pt3)
sctp TRACE: 23:23:33.054979 association.go:1523: [0xc000091380] T3-rtx timer start (pt3)
sctp DEBUG: 23:23:45.856419 association.go:2161: [0xc000091380] T3-rtx timed out: nRtos=5 cwnd=1228 ssthresh=4912

Two retransmission occur right after the T3-rtx timeout, but there's no SACK

  • TSN=903304685
  • TSN=903304692

Two errors

% cat frozen2.log |grep 'ERROR'
sctp ERROR: 2021/01/30 23:23:45 [0xc000091380] stream 5 not found)
sctp ERROR: 2021/01/30 23:23:45 [0xc000091380] stream 5 not found)

My Thoughts

  • DATA chunk is coming in, so inbound traffic appears to be working
  • The retransmission should cause SACK to come back but not happening. Outbound traffic is not reaching the peer?

@daonb I have no clue other than the peer stop sending SACK (Selective ACK). Can you describe more about your environment?

  • How do use data channels?
  • Do you use any "partial reliability" options? (maxRetransmits, maxPacketLifetime, etc)
  • What is the network configuration (any routers, NATs in between the two endpoints, etc)
  • How often would this happen?
  • Only with iPadOS? How about with other browsers or pion-to-pion?

Recently, I came across a bug in usrsctp (most of the browsers use) that occurred when the partial reliability option was used over a lossy connection - See #138 - possibly related. I am not sure if iPadOS uses it, and I don't know if the fix has already been applied to the browsers out there.

enobufs avatar Feb 19 '21 10:02 enobufs

Thanks for diving into this, I appreciate the help. Here are the answers to your questions, followed by one question I have.

  • How do use data channels?

I'm using pion for a backend - webexec - used by terminal7 - a hybrid tablet app. Each terminal pane has its own data channel used to pipe data in & out. In addition there's one data channel used for control. The 37 incoming messages are 37 keystrokes. Every time my server gets this kind of a message it passes it to the shell which causes the shell to print it on its pseudo terminal and webexec sends it back to the client. I know my server calls sctp to send the data but it doesn't get to

  • Do you use any "partial reliability" options? (maxRetransmits, maxPacketLifetime, etc)

I only set the ICETimeouts and the LoggerFactory

  • What is the network configuration (any routers, NATs in between the two endpoints, etc)

I use a home network to connect to a low latency host server (latest speettest - 24.42Mbps download, 2.8 Mbps upload)

  • How often would this happen?

It happens only when the network is overloaded wihch is quite often: I share the network with my two daughters and sometime they are both on zoom at the same time. I'm sure my uplink is badly congested and the SACKs suffer huge delays and sometimes dropped.

  • Only with iPadOS? How about with other browsers or pion-to-pion?

I should have it running on chrome-android soon so I'll test it once I have it

Two retransmission occur right after the T3-rtx timeout, but there's no SACK

Why only two retransmissions? I suspect the client sends the SACKs but they don't get to the server and the same happens with the ACKs on both retransmits. Shouldn't rtcp keep retransmitting until it gets an ACK?

daonb avatar Feb 19 '21 13:02 daonb

Why only two retransmissions? I suspect the client sends the SACKs but they don't get to the server and the same happens with the ACKs on both retransmits. Shouldn't rtcp keep retransmitting until it gets an ACK?

Looking at this line:

sctp DEBUG: 23:23:45.856419 association.go:2161: [0xc000091380] T3-rtx timed out: nRtos=5 cwnd=1228 ssthresh=4912

The retransmission on the log was made after the 5th attempt (nRtos: the number of retransmission timeouts) and the log you gave me only had the last 23 seconds. Pion/sctp implements backoff algorithm as RFC says, doubling the interval time every time T3-rtx expire, so at that point, the interval was 32 seconds at least (depends on the measured RTT).

enobufs avatar Feb 19 '21 18:02 enobufs

Great! We've solved this.

IMO, the doubling of retransmission timeout is wrong for real time communications. I'm not saying you should change it, it's in the RFC:

The computation and management of RTO in SCTP follow closely how TCP manages its retransmission timer.

I think the authors were too lazy here. TCP was made for file transfer, not real-time communication. Luckily, in C7 the RFC defines RTO.max which let's me limit the retransmission timeout.

Reading the code I learned we do have an rtoMax but it's a constant with a value of one minute. For webexec, I want to set it to 3 seconds and I believe it will solve my problem.

I guess I need to add RTOMax to webrtc.DataChannelParameters and make sure it gets all the way down to rtxTimer. Right?

daonb avatar Feb 20 '21 19:02 daonb

@daonb we don't control those, I don't know if we can add stuff. Here are all the values. Since we also compile to WASM mode it will cause portability issues.

I think the best thing we could do is create an issue against webrtc-pc and see if there is any interest ideas. I would call out that webrtc-nv matches up with your use case.

If they completely push back then I would say adding a SettingEngine option is the best thing we can do. Then it is explicit that it is non-portable. I just don't want to add proprietary extensions and set the WebRTC space :(

Sean-Der avatar Feb 20 '21 20:02 Sean-Der

It was the fifth retransmission got timed out, and the resulting retransmission also failed (I don't see SACK for it). Could you tell me why you'd think reducing max RTO would help? I think something else is going on.

We could make the max RTO configurable, but once a "pipe" is clogged, it usually takes a long time for the congestion to be cleared and I cannot think of any other way than the exponential backoff particularly when you cannot rule out the timeout was caused by congestion. I think you should find out why the browser is not responding to all these retransmissions in the first place.

Maybe you could modify the value by hand (using replace in go.mod, etc) and see what happens first?

enobufs avatar Feb 20 '21 20:02 enobufs

In the old TCP days everyone was transferring files so congestion at the edge router was common and the backoff made sense. It was a way to wait for other transfers to finish. Usually, large files were transferred in the background and the back off timer worked great. We didn't really care when the transfer ends as long as the magic happens and the files appear on the target server.

Today's congestion is completely different. It's a result of rate controlled live video streaming and waiting for it to end does not make any sense. What's more, My application takes very little bandwidth compared to video streaming. With RTOMax of 2-3 seconds I have a better chance of getting some packets through before the user is frustrated.

I know that 60 seconds for RTOMax comes from the standard but it's wrong for real time applications. No one will wait 60 seconds to see the key they hit appear on the screen.

I wanted to set RTOMax by the channel because I want to add file transfer and in that case I need the exponential backoff. No matter, I'll cross the bridge when I get there. For now, I I think I should open an issue in pion/webrtc to add RTOMax to SettingEngine as Sean suggested and close this one.

daonb avatar Feb 24 '21 16:02 daonb

@daonb you can skip right to the PR even (and link to this issue in the commit). I am happy to merge this right away so you can get back to building :)

I would love to bring it up in the W3C. I bet lots of other devs are hitting it, they just don't understand this stuff at the level you are. Even in the WebRTC working group only a small subset of people understand SCTP.

Sean-Der avatar Feb 24 '21 17:02 Sean-Der

I would love to take it up with W3C. It's fun to be reminded of a products I developed back in the '90s and what the net was like back in the last millennium.

daonb avatar Feb 24 '21 17:02 daonb

@enobufs I've found a case where this seems to consistently fail.

See https://github.com/cdr/coder-cli/pull/368/commits/5ff49026d3118f8647ceaa4c895ba1980a2ed233

If you copy the DataChannel to io.Discard on the server-side, the issue never happens.

https://github.com/cdr/coder-cli/pull/368/files#diff-62f8c8d3fb7e16200851008a80e52e4c000482ebe616712a6018431924c67866R333

Running tests with

go test -v -bench=. cdr.dev/coder-cli/wsnet

Should result in output similar to...

...
sctp DEBUG: 05:42:02.761286 association.go:756: [0xc00019e700] sending SACK: SACK cumTsnAck=3461170072 arwnd=1026976 dupTsn=[]
sctp DEBUG: 05:42:02.761330 association.go:756: [0xc00019e700] sending SACK: SACK cumTsnAck=3461170075 arwnd=1023376 dupTsn=[]
sctp DEBUG: 05:42:02.761404 association.go:756: [0xc00019e700] sending SACK: SACK cumTsnAck=3461170077 arwnd=1020976 dupTsn=[]
sctp DEBUG: 05:42:03.761639 association.go:2399: [0xc0003faa80] T3-rtx timed out: nRtos=1 cwnd=1228 ssthresh=531434
sctp DEBUG: 05:42:05.762754 association.go:2399: [0xc0003faa80] T3-rtx timed out: nRtos=2 cwnd=1228 ssthresh=4912
ice INFO: 2021/06/15 05:42:08 Setting new connection state: Disconnected
pc INFO: 2021/06/15 05:42:08 ICE connection state changed: disconnected
pc INFO: 2021/06/15 05:42:08 peer connection state changed: disconnected

kylecarbs avatar Jun 15 '21 05:06 kylecarbs

I've isolated it further in this commit:

https://github.com/cdr/coder-cli/pull/368/commits/0dc0e56cfef10b2b1cb85603c2da7cd7e6a42a47

Uncomment the following line, and it works as expected:

https://github.com/cdr/coder-cli/pull/368/commits/0dc0e56cfef10b2b1cb85603c2da7cd7e6a42a47#diff-34c91981f059bb5d86d78bb53990a009247009fcfc946b00884b03f5c1a63d55R95

My assumption is there's a race condition somewhere when we write a lot of individual messages at the same time... but I'm not exactly sure where to look.

kylecarbs avatar Jun 15 '21 18:06 kylecarbs