sctp icon indicating copy to clipboard operation
sctp copied to clipboard

Add RACK to SCTP

Open philipch07 opened this issue 4 months ago • 19 comments

Description

Adds RACK to SCTP (see the paper that this is based on here: https://icnp20.cs.ucr.edu/proceedings/nipaa/RACK%20for%20SCTP.pdf)

I would also like to add the variable burst mitigation as described in the above paper (section 5a), but that will be done in a separate PR as it doesn't require RACK to be implemented.

Reference issue

Closes #389

Upd: expanded version of the paper above: https://duepublico2.uni-due.de/servlets/MCRFileNodeServlet/duepublico_derivate_00073893/Diss_Weinrank.pdf#page=97

philipch07 avatar Oct 04 '25 18:10 philipch07

Codecov Report

:x: Patch coverage is 93.82353% with 21 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 84.23%. Comparing base (1305e2a) to head (11f3274). :warning: Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
association.go 93.20% 12 Missing and 9 partials :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #390      +/-   ##
==========================================
+ Coverage   83.64%   84.23%   +0.58%     
==========================================
  Files          51       52       +1     
  Lines        3460     3767     +307     
==========================================
+ Hits         2894     3173     +279     
- Misses        428      444      +16     
- Partials      138      150      +12     
Flag Coverage Δ
go 84.23% <93.82%> (+0.58%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Oct 04 '25 18:10 codecov[bot]

@enobufs Hi, I think you've previously tried to implement RACK for SCTP, would you happen to have any pointers for whether my current work looks like it's on the right track, or any notes/comments on things that you had in mind? If you also have a branch for me to use as reference if you've worked on it before then that would also be very appreciated!

philipch07 avatar Oct 05 '25 00:10 philipch07

To test this I used this branch's SCTP version in the following examples:

  • Data-channel example from webrtc
    • v4.1.6 (https://github.com/pion/webrtc/pull/3251)
    • v4.1.0
    • v4.0.0
    • v3.3.6
    • v3.2.51
  • The simple data-channel example from https://github.com/pion/webrtc/pull/3252
  • The mDNS example (https://github.com/pion/mdns/pull/248)

Each of the examples (and the listed versions) worked as the behavior matched the expected behavior, so I think it's looking good so far :)

philipch07 avatar Oct 22 '25 18:10 philipch07

@philipch07 this is beyond amazing. You are doing something phenomenal and making it look easy :)

Anything I can do to help? In the back of your head think about writing a blogpost.

  • You get the credit you deserve
  • Its gets people excited/teaches the next generation about why this stuff is fun :)

Sean-Der avatar Oct 22 '25 18:10 Sean-Der

Thank you @Sean-Der! I think the only thing I need right now regarding this PR is more reviews and testing (ideally in real applications with some latency). I'm excited but also cautious about this since it seems to be a pretty desired feature!

The next thing I have in mind at the moment for RACK would be #394 (adaptive burst mitigation) which only further improves RACK's performance. I hope that the two features would make for a compelling upgrade for users :)

I'm also brainstorming some ideas for the blog post as you've mentioned but it may be tied closer to open source contributing as a whole instead of just about RACK (especially since this type of feature is a bit less flashy). With some time and luck, I think I can create an interesting narrative about my motives, interests, and journey through the contributions that I've made so far!

philipch07 avatar Oct 22 '25 21:10 philipch07

Also I performed a short test with Linux's traffic control in my WSL setup to simulate 500ms latency for outbound traffic (so anything going from server -> client would be delayed) using the data-channels example from webrtc (latest version, so v4.1.6) and it didn't drop any messages in either direction, even when I sent a burst of messages to the server (all of which were instant), and with intermittent messages to the server.

philipch07 avatar Oct 22 '25 21:10 philipch07

@philipch07 Could you test this with https://github.com/pion/example-webrtc-applications/tree/master/ebiten-game

ValorZard avatar Oct 23 '25 00:10 ValorZard

also this is off topic, but @philipch07 are you in the discord server? I'd like to talk you more casually about SCTP/datachannel stuff.

ValorZard avatar Oct 23 '25 02:10 ValorZard

this type of feature is a bit less flashy

This is the flashiest thing that could happen. If you can put up a graph that shows 'Under these network conditions file transfers happen X faster' and call out stuff that use DataChannels + Go (WebTorrent and Tor Snowflake) that is a huge thing.

I would love to introduce you to companies with a flashy intro of 'This is the person that made WebRTC Datachannels n% faster'. This and ICE Renomination that @kevmo314 working on I think have biggest change of just instantly making peoples WebRTC sessions better with no effort on their side.

Sean-Der avatar Oct 23 '25 02:10 Sean-Der

@philipch07 Could you test this with https://github.com/pion/example-webrtc-applications/tree/master/ebiten-game

Just finished testing this and confirmed that this works! I didn't test it with added latency though.

This is the flashiest thing that could happen. If you can put up a graph that shows 'Under these network conditions file transfers happen X faster' and call out stuff that use DataChannels + Go (WebTorrent and Tor Snowflake) that is a huge thing.

That's a fantastic idea, that makes me think of creating some sort of visual simulation to display the impact of RACK vs non-RACK implementation in practice!

I would love to introduce you to companies with a flashy intro of 'This is the person that made WebRTC Datachannels n% faster'. This and ICE Renomination that @kevmo314 working on I think have biggest change of just instantly making peoples WebRTC sessions better with no effort on their side.

That makes sense to me, and I agree that "free" performance with 0 changes/settings toggled is always a win for a user :)

philipch07 avatar Oct 23 '25 03:10 philipch07

Whoops forgot to @Sean-Der in the above comment^

philipch07 avatar Oct 23 '25 03:10 philipch07

Chiming in to say that this is really cool! The paper is really interesting, so cool to see this in pion.

kevmo314 avatar Oct 23 '25 03:10 kevmo314

I just found a slightly expanded version of the paper that we've all been looking at 🤦. Here is the slightly expanded version: https://duepublico2.uni-due.de/servlets/MCRFileNodeServlet/duepublico_derivate_00073893/Diss_Weinrank.pdf#page=97

From what I can tell, the core information is mostly the same but it has a much expanded section 5.

I just updated the PR description to include it as well.

philipch07 avatar Oct 24 '25 22:10 philipch07

@philipch07 nice work! I was busy earlier this week and did not have time to look on it yet. I am reading paper now.

sirzooro avatar Oct 25 '25 07:10 sirzooro

@philipch07 Hello I added a test for the condition you asked me to test in discord dms (High RTT -> low RTT), after you rebase you can un-comment this part https://github.com/pion/sctp/blob/012d2b44dbc178427e0c25b8f17a527f04e2032c/vnet_test.go#L944-L957 Your Implementation handles it very well:

  1. No missing data.
  2. No excessive fast retransmits.

I'll also test for the other way around.

JoTurk avatar Oct 25 '25 09:10 JoTurk

Going from low RTT to high RTT causes the packets to drop, but that's fixed after commit #244b97f

It needed to go to extreme values tho, and it was more of a manual test, I'll try to get it as a CI test, but It's very hard to make in a way that's not flaky and under ~20s

before:

--- FAIL: TestRACK_RTTSwitch_Regression (41.00s)
    vnet_test.go:887: 
                Error Trace:    /home/joe/work/pion/sctp/vnet_test.go:887
                Error:          regression
                Test:           TestRACK_RTTSwitch_Regression
                Messages:       missing echoes: got=61 want=200 missing=[61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199]
FAIL
FAIL    github.com/pion/sctp    41.006s
FAIL

After:

[joe@nixos:~/work/pion/sctp]$ go test -run ^TestRACK_RTTSwitch_Regression$ github.com/pion/sctp -count 1
ok  	github.com/pion/sctp	36.082s

Thank you.

JoTurk avatar Oct 25 '25 19:10 JoTurk

Also there are many network conditions that makes the current sctp implementation drop packets but this implementation handles it so well. really amazing work.

JoTurk avatar Oct 25 '25 19:10 JoTurk

@sirzooro Let me know if you're interested in reviewing this PR before I merge it and if you have any eta if so!

philipch07 avatar Oct 30 '25 03:10 philipch07

During some SCP testing, @JoeTurki found that the CPU utilization was a little higher than expected. Luckily there's in RFC 8985 sec 6.2 there's a section that I originally overlooked which addresses the concern:

As an optimization, an implementation can choose to check only segments that have been sent before RACK.xmit_ts. This can be more efficient than scanning the entire SACK scoreboard, especially when there are many segments in flight. The implementation can use a separate doubly linked list ordered by Segment.xmit_ts, insert a segment at the tail of the list when it is (re)transmitted, and remove a segment from the list when it is delivered or marked as lost. In Linux TCP, this optimization improved CPU usage by orders of magnitude during some fast recovery episodes on high-speed WAN networks.

The new commits address the issue and will be tested to see whether the more involved fix is worth it over the simple fix regarding CPU usage among other stats.

philipch07 avatar Nov 23 '25 23:11 philipch07