str0m icon indicating copy to clipboard operation
str0m copied to clipboard

NACK reporter is ineffective in wireless network environments

Open lherman-cs opened this issue 1 month ago • 11 comments

From my understanding, the current NACK reporter generates a NACK only after at least 33ms (NACK_MIN_INTERVAL) have passed since the last NACK. Each missing packet can be reported up to 5 times (MAX_NACKS). The NACK window length is determined by MAX_MISORDER. When a new highest sequence number arrives, the sliding window advances and drops any incoming packets with sequence_number < new_highest_sequence_number - MAX_MISORDER.

Unfortunately, the current NACK reporter becomes ineffective under several common wireless network conditions such as LTE and 5G.

Low Jitter and Low Packet Loss

Assume:

  • jitter ≈ 30 ms
  • packet_loss ≈ 1%

In this scenario, the receiver’s jitter buffer is typically small (around 30–50 ms). Because the NACK reporter waits at least 33 ms before sending a NACK, there is already a minimum delay of 33 ms just to notify the sender about a missing packet. In practice, the retransmission will take 33 ms + 1 RTT before it reaches the SFU.

Given the small jitter buffer on the client side, the likelihood that the retransmitted packet becomes unusable by the time it arrives is high.

Proposal: The NACK reporter should send the first NACK as soon as a sequence gap is detected, or at least use the observed jitter as a baseline before applying throttling.

High RTT and Low Packet Loss

Assume:

  • RTT ≈ 200 ms
  • packet_loss ≈ 1%
  • received sequence: P1, P2, _, P4

When the NACK reporter sees a gap after receiving P4, it generates a NACK for P3 in the next report. It then continues sending NACKs for P3 until the packet is received or until the MAX_NACKS limit is reached. Once it hits MAX_NACKS, the window advances and the packet is treated as permanently lost—even if the retransmission eventually arrives.

This issue occurs whenever:

NACK_MIN_INTERVAL * MAX_NACKS < RTT

In these cases, the reporter exhausts all retry attempts before the retransmitted packet has any chance of arriving, since the retransmission itself requires at least one RTT.

Proposal: The NACK reporter should be RTT-aware. Instead of using a static NACK_MIN_INTERVAL, it could use a dynamic value such as RTT / 2, or otherwise base the interval on the estimated RTT.


Expected Outcome: Less bandwidth used for retransmission, and and the stream should be smoother. Each NACK will have a higher chance to recover, and more responsive to a legitimate packet loss.


Context: I've been observing this unrecoverable packet loss issues typically while I'm on LTE. I have an SFU implementation with RTP mode and no jitter buffer. The SFU just acts a dumb pipe. This is the actual obfuscated ping result (definitely not a friendly network, but great for an experiment 🙂):

64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=1 ttl=47 time=132 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=2 ttl=47 time=226 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=3 ttl=47 time=246 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=4 ttl=47 time=206 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=5 ttl=47 time=99.3 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=6 ttl=47 time=97.9 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=7 ttl=47 time=90.0 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=8 ttl=47 time=103 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=9 ttl=47 time=152 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=10 ttl=47 time=282 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=11 ttl=47 time=178 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=12 ttl=47 time=250 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=13 ttl=47 time=98.4 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=14 ttl=47 time=192 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=15 ttl=47 time=96.2 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=16 ttl=47 time=237 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=17 ttl=47 time=363 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=18 ttl=47 time=502 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=19 ttl=47 time=322 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=20 ttl=47 time=228 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=21 ttl=47 time=251 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=22 ttl=47 time=171 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=23 ttl=47 time=193 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=24 ttl=47 time=155 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=25 ttl=47 time=238 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=26 ttl=47 time=157 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=27 ttl=47 time=181 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=28 ttl=47 time=204 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=29 ttl=47 time=227 ms

--- example-host ping statistics ---
29 packets transmitted, 29 received, 0% packet loss, time 28035ms
rtt min/avg/max/mdev = 90.028/202.622/502.172/88.494 ms

Please let me know what you guys think or correct my understanding. I'm happy to make the contribution too!

lherman-cs avatar Nov 18 '25 01:11 lherman-cs

This is a good writeup, whatever changes is done I think we should allow it to be configurable and ideally configurable in runtime. This will allow users to implement their own tuning.

xnorpx avatar Nov 18 '25 01:11 xnorpx

This is a good writeup

AI generated?

Please let me know what you guys think or correct my understanding. I'm happy to make the contribution too!

It sounds like we need to improve the algorithm here, but rather than continuing to rolling our own, I think we should study prior art in libWebRTC.

algesten avatar Nov 18 '25 08:11 algesten

@lherman-cs in the "High RTT and Low Packet Loss" scenario, a late packet is still accepted as long as its sequence number stays inside the current NACK window, i.e. seq >= active.start and not older than end - MAX_MISORDER. Hitting MAX_NACKS only stops further NACKs being generated; it does not by itself mark the packet as “lost” or rejected.

Regarding the first scenario "Low RTT" - there is at most (not min) a 33ms delay before sending the nack (it's a fixed schedule). The average will be 15ms. A receiving jitter buffer of only 30ms seems overly aggressive especially over the open internet, but it any case it would adapt (increase) based on str0m retransmission rate. Are you reproducing this issue in a real use case ?

The NACK reporter should send the first NACK as soon as a sequence gap is detected

I think we need some allowance for out-of-order, no ?

davibe avatar Nov 18 '25 09:11 davibe

This is a good writeup

AI generated?

Please let me know what you guys think or correct my understanding. I'm happy to make the contribution too!

It sounds like we need to improve the algorithm here, but rather than continuing to rolling our own, I think we should study prior art in libWebRTC.

This is not AI generated. I wrote this myself.. This is probably my bad habit to be too formal in joining a new community 🙂.

lherman-cs avatar Nov 18 '25 12:11 lherman-cs

I think we need some allowance for out-of-order, no ?

Reorder is very rare, (unless FEC or RTX based on our data). I would say one should request NACK as soon as you see a gap.

O

xnorpx avatar Nov 18 '25 14:11 xnorpx

@lherman-cs in the "High RTT and Low Packet Loss" scenario, a late packet is still accepted as long as its sequence number stays inside the current NACK window, i.e. seq >= active.start and not older than end - MAX_MISORDER. Hitting MAX_NACKS only stops further NACKs being generated; it does not by itself mark the packet as “lost” or rejected.

That's a good point. I suppose it only gets triggered if the network also has high jitter. In my case, I was streaming ~1.5Mbps or ~200-300pps (according to Chrome WebRTC internal). The NACK reporter should be resistant up to ~330ms of jitter. Looking at my ping result, it is technically possible for the jitter to jump to ~330ms, but it is certainly questionable whether this packet should be recoverable at this point. I'm going to keep testing this further with my LTE, and see if I can simulate the network locally to reproduce.

64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=16 ttl=47 time=237 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=17 ttl=47 time=363 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=18 ttl=47 time=502 ms
64 bytes from 123-77-192-44.example.net (123.77.192.44): icmp_seq=19 ttl=47 time=322 ms

Regarding the first scenario "Low RTT" - there is at most (not min) a 33ms delay before sending the nack (it's a fixed schedule). The average will be 15ms. A receiving jitter buffer of only 30ms seems overly aggressive especially over the open internet, but it any case it would adapt (increase) based on str0m retransmission rate. Are you reproducing this issue in a real use case ?

I might've just misread this. But, isn't this saying to generate a new NACK report only after MIN_NACK_INTERVAL?

https://github.com/algesten/str0m/blob/2014cbb54358f45cc0a52adfa590f0020a972a9b/src/session.rs#L798-L804

https://github.com/algesten/str0m/blob/2014cbb54358f45cc0a52adfa590f0020a972a9b/src/session.rs#L211-L245

Right, the NACK report got polled in an interval. I didn't mention this, but the actual NACK report should be scheduled even later than 33ms. Technically, the actual delay can be up to 33ms (NACK_MIN_INTERVAL) + 15ms (poll interval) = ~48ms.

This is the subscriber's WebRTC stats. The connection went through the public network, and the server was ~600 miles away. The jitter buffer (jitterBufferDelay/jitterBufferEmittedCount_in_ms) is hovering ~30ms. I usually see low jitter buffer with low-bitrate streams, which makes sense since each frame has less amount of packets.

Image

Yes, I'm seeing this in my lte network (see above for the ping result, I just obfuscated the IP address). The video froze once in a while with small packet loss on the publisher side, and str0m debug log showed NACK has been retransmitted for MAX_NACKS times. I didn't check if the sliding window marked the packet as permanently lost though. I need to do more tests here.

The NACK reporter should send the first NACK as soon as a sequence gap is detected

I think we need some allowance for out-of-order, no ?

I think it highly depends on the application, choosing between goodput vs latency. To favor lower latency, reporting NACK as soon as we see a gap allows the receiver to render the frame quicker, but this will increase bandwidth usage and reduce the actual stream quality.

lherman-cs avatar Nov 18 '25 15:11 lherman-cs

@davibe @algesten A bit tangent, but there’s an off-by-one fix that triggers on NACK report boundaries: https://github.com/algesten/str0m/pull/748. I don't think it's the main contributor to the packet loss I'm seeing here, but it should help a bit.

lherman-cs avatar Nov 18 '25 15:11 lherman-cs

This is not AI generated. I wrote this myself.. This is probably my bad habit to be too formal in joining a new community 🙂.

No worries! Welcome! :)

I compared str0m with libWebRTC:

What WebRTC does that str0m doesn't:

Timing:

  • Waits 1 RTT between NACK retries (str0m retries immediately)
  • 20ms periodic NACK processor for batching
  • RTT-based retransmission throttling

Reordering:

  • Reordering histogram tracking (128 packets)
  • Probability-based delay before first NACK
  • Learns network patterns to avoid false NACKs

Audio-specific:

  • Time-to-play estimation (don't NACK late packets)
  • Packet loss rate tracking (exponential filter)
  • Stops NACKing when loss rate too high

Send-side:

  • Prevents duplicate retransmissions (pending flag)
  • Culls acknowledged packets from history
  • Retransmission counter per packet

Scale:

  • 9600 packet history (str0m: ~100)
  • 100 max retries (str0m: 5)
  • Keyframe fallback when NACK list overflows

Core difference: WebRTC waits intelligently based on RTT/reordering/loss-rate; str0m NACKs immediately on any gap.

algesten avatar Nov 18 '25 19:11 algesten

Keyframe fallback when NACK list overflows

I consider this another core difference. I could implement this feature, as it was already included in my previous implementation.

Razzwan avatar Nov 20 '25 14:11 Razzwan

@Razzwan that would be great! I haven't done any leg work yet. I'm depriotizing this issue for now.

@algesten thanks for sharing the differences with libwebrtc! I'm surprised with how big the packet history size and the max retries, it seems exceesive? I need to study libwebrtc independently..

lherman-cs avatar Nov 20 '25 23:11 lherman-cs

@lherman-cs would it be possible for you to test #754 ?

It's an attempt to relate NACK frequency to RTT.

algesten avatar Nov 21 '25 21:11 algesten

@lherman-cs would it be possible for you to test #754 ?

It's an attempt to relate NACK frequency to RTT.

Nice! I'll test again in my network for the next few days, and report back here.

lherman-cs avatar Nov 23 '25 16:11 lherman-cs