mediasoup icon indicating copy to clipboard operation
mediasoup copied to clipboard

Unexpected freezes with VP9 SVC and packetloss

Open geirbakke opened this issue 2 months ago • 18 comments

Bug Report

Your environment

  • Chrome 140,
  • MacOS 15.6.1
  • Mediasoup online demo

Issue description

We see lots of freezes when introducing packetloss on reception of VP9 SVC. Seems to only happen when using SVC - not K-SVC.

Reproducible with mediasoup demo when forcing VP9 and overriding default scalabilitymode: https://v3demo.mediasoup.org/?roomId=vp9svcpacketloss&forceVP9=true&webcamScalabilityMode=L3T3

I think this might be a browser/lib issue, so I have reported it there: https://issues.webrtc.org/issues/449408585 But if i'm wrong and it is a mediasoup issue, hopefully you can use this info to track it down

I narrowed it down to the freezes happening when dropping a single packet and then dropping the first re-transmit of that packet. Somehow we only receive a new nack (re-nack same seq number) if it is the first packet in a picture (first packet after a packet with marker bit being sent). Since it doesn't get the re-nack, the receiver never gets what it needs and gives up after a while and sends a PLI which unfreezes the image.

I can reproduce by:

  • running latest release and testing on localhost (no network lag or packetloss)
  • dropping every Nth (i.e 200th) packet in WebRtcTransport (by not calling this->iceServer->GetSelectedTuple()->Send(data, len, cb);)
  • drop the first RTX packet after this

Logging packets i see it only freezes when not re-nacking the packet i stopped retransmission of. And it only does that when it isn't the first packet of picture.

Maybe related to: https://github.com/versatica/mediasoup/issues/1536

geirbakke avatar Oct 06 '25 09:10 geirbakke

Thanks for the report. We will take a look when we get some time. But I just wonder how this can be an issue in mediasoup if the browser doesn't retransmit a NACK if it didn't receive a packet retranmission...

ibc avatar Oct 06 '25 09:10 ibc

Very well detailed @geirbakke, I'm following the chrome issue :+1:

jmillan avatar Oct 06 '25 09:10 jmillan

I'm not 100% sure the browser doesn't retransmit nack. Didn't wireshark, only custom logged in mediasoup and didn't see any nack there. I'm not familiar with mediasoup code, but i thought in theory maybe some VP9 or RTP rewriting could make the browser not NACK? I don't know, it doesn't make sense either way to me, browser or mediasoup, why i don't see that nack when there is a gap in sequence and image freezes only to recover after a PLI coming long after.

geirbakke avatar Oct 06 '25 09:10 geirbakke

@geirbakke, it would be really useful if you could sniff and check with wireshark that the NACK indeed is not sent by the browser.

jmillan avatar Oct 06 '25 10:10 jmillan

I'll investigate this.

jmillan avatar Oct 07 '25 07:10 jmillan

I have confirmed it with wireshark. It never sends the second nack. If i tweak the code to only drop when the previous packet had market bit set, it never freezes and always re-nacks. If previous packet marker bit isn't set, it will only nack once and freeze

geirbakke avatar Oct 07 '25 08:10 geirbakke

Thanks @geirbakke, could you please share this info in the chrome issue, sharing the details and the .pcap too? This will likely bring their attention.

jmillan avatar Oct 07 '25 08:10 jmillan

@geirbakke, PR done #1620. It's working nicely in my tests. Can you please give it a try?

jmillan avatar Oct 10 '25 10:10 jmillan

thx @jmillan, but i'm still afraid it might not be a mediasoup issue. As far as i understand it the marker bit was already correctly set. Now, if there is a bug running in the browser preventing repeated nacks happening when in the middle of a picture, then there will of course be less of this happening if adding a lot of those marker bits. but I fear this workaround can have side effects. Anyway i will test it next week, and if it was indeed a bug in how mediasoup sets marker bit, this PR should reduce freezes to 0 in my targeted test. Also commented in webrtc.org thread.

geirbakke avatar Oct 10 '25 13:10 geirbakke

The spec says that the marker must be set:

This bit MUST be set to one for the final packet of the highest spatial-layer frame (the final packet of the picture); otherwise, it is zero. Unless spatial scalability is in use for this picture, this bit will have the same value as the E bit described in [Section 4.2](https://www.rfc-editor.org/rfc/rfc9628.html#VP9payloadDescriptor). Note this bit MUST be set to one for the target spatial-layer frame if a stream is being rewritten to remove higher spatial layers.

This: " Note this bit MUST be set to one for the target spatial-layer frame if a stream is being rewritten to remove higher spatial layers."

We are not rewritting anything in VP9. So we are now being compliant, and it's actually working perfectly. Since we are not signaling by any means (inband our out of band) to the decoder which is the target spatial layer, it's impossible it knows, and hence we were getting the freezes before.

jmillan avatar Oct 10 '25 13:10 jmillan

When you subscribe to a lower layer, you are rewriting, as you are dropping, and at least altering the marker bit, afaiu.

I think

This bit MUST be set to one for the target spatial layer frame if a stream i being rewritten

because unless you do that, you might never get the marker bit (if filtering out the highest spatial layer). And you did that properly before, at least since you last fixed this (before that there was problems subscribing to a lower layer SVC stream afaiu in some cases, probably because of missing marker bit).

We already have the E bit, so why would we need a marker bit that is just mirroring this? The marker bit should signal the end of the picture, not the end of the frame. The E bit is end of frame. The maker bit is end of picture.

I can imagine it is valuable for the decoder to know there more frames coming expanding on the details of the current picture, by looking at (the missing) marker bit.

(the final packet of the picture); otherwise, it is zero This means afaiu that there should be only 1 packet with marker bit in a picture. and a picture can have many frames. With the merged PR every frame gets the marker bit right? and it is just a mirror of e bit.

geirbakke avatar Oct 10 '25 13:10 geirbakke

As said before, we do not signal the decoder what our Target Spatial Layer (or. Current Spatial Layer) is, so setting the marker bit to 1 only for such a spatial layer is something that the decoder cannot handle in any way.

jmillan avatar Oct 10 '25 13:10 jmillan

where does it say that this depends on the receivers knowledge of the sending side configuration? how about if bandwidth is too low to send the highest layer/frame and the sender needs to drop it - does that have to be signalled? and should it from this moment start setting marker bit on every frame?

And none of this explains why it doesn't re-nack when in the middle of a picture. and only on svc, not k-svc. Adding lots of marker bits (1 for every frame) will explain why you get less freezes if the when-to-nack logic depends on the marker bit being set. but why should it?

geirbakke avatar Oct 10 '25 14:10 geirbakke

how about if bandwidth is too low to send the highest layer/frame and the sender needs to drop it - does that have to be signalled

In that case the receiver will never receive a marker bit. Does the decoder not need it, also considering packet loss? I don't know.

And none of this explains why it doesn't re-nack when in the middle of a picture. and only on svc, not k-svc.

That's true. Honestly, I don't know. I didn't test this myself, but please, do insist in the chrome issue if you made the tests and have the proof about the retransmissions. It may very well be a bug in libwebrtc.

Adding lots of marker bits (1 for every frame) will explain why you get less freezes if the when-to-nack logic depends on the marker bit being set. but why should it?

The spec says that unless spatial scalability is in use for this picture, marker bit will have the same value as the E. In our case we are doing spatial scalability and should do it as we did before, that's true. But definitely it was not working, and the numbers are there. So I'll add an inline comment about it.

Please keep pushing on the chrome issue.

NOTE: Commented on the chrome issue.

jmillan avatar Oct 10 '25 14:10 jmillan

thx @jmillan 🤞

geirbakke avatar Oct 10 '25 14:10 geirbakke

Reopening as setting the RTP marker true to all descriptions containing end of frame does not fix the issue. Freezes go away but only the lowest resolution is rendered.

jmillan avatar Oct 14 '25 08:10 jmillan

@geirbakke, if I'm not mistaken you claim that when a packet is lost and its first RTX is also lost, and there are no more NACKs for it, then freezes occur.

Can you please take some time to expose here and/or in the libwebrtc issue you opened, the data that supports it? This is, a .pcap file that shows how certain packets do NOT arrive to chrome (due to packet loss) and that it only NACKs once even if the RTX does not arrive?

How to do it?

Using chrome logs to dump the RTP/RTCP and later generate a .pcap with the text2pcap tool as explained here. No need to use video_replay. In summary: 1- Start chrome with logs enabled (it will also add RTP/RTCP data in logs) 2- STR the issue. 3- Retrieve the .pcap out of those logs as indicated in the provided link.

NOTE: RTP will be plain (no srtp) so the content will be readable within wireshark, etc.

I'm afraid that without that info, the libwebrtc issue will stale forever as there is no actionable item. Showing how certain packets are NOT nacked again vs how others are, and how this generates freezes will IMO bring their attention, but for that, real data is needed.

jmillan avatar Oct 14 '25 15:10 jmillan

@jmillan correct. that is my claim, which i verified with wireshark. i don't have time to work more on this currently, but I will post pcap or just learn and work directly on libwebrtc when i get a bit more time if noone has managed to reproduce by then. imo others reproducing it will be much clearer indication that it is indeed the case, than a pcap. also the libwebrtc thread might be a bit confusing when reading the post about it should being closed and the follow ups from there.

the simplest way for you reproduce might be modifying mediasoup code like mentioned in first post

dropping every Nth (i.e 200th) packet in WebRtcTransport (by not calling this->iceServer->GetSelectedTuple()->Send(data, len, cb);)

then add some logic so that you drop the first rtx packet after. i just checked packet->GetPayloadType() matching what is usually the payload type for rtx in our setup

if you then log when you drop + the marker (packet->HasMarker()) on every packet sent, you hopefully will see the 100% correlation (at least on my test) of previous packet marker bit to freezes.

if you log the nacks, or wireshark it, you will see that it only re-nacks as described.

it will help making sure you only send 1 video, no audio, only one direction. to reduce noise

geirbakke avatar Oct 14 '25 16:10 geirbakke