mp4-muxer B-frame support

Hi there! Great library! Thank you for providing it! I'm using it to mux together "raw" video and audio packets for use with the MediaSource API to do a low-latency preview for a live streaming application. It works great!

While we discourage the use of B-frames (to minimize decode latency), one of our testers had B-frames enabled in his encoder, and received the Timestamps must be monotonically increasing error. I am using the PTS as the timestamp, but with B-frames these can definitely be out of order (the decode times, aka DTS values, are monotonically increasing, however).

I saw you referenced this issue in this comment:

Regarding b-frames and PST and DTS, there's actually no logic in here from my side - I simply pipe all of the encoded frame data in and have PST=DTS everywhere. Is this an error on my part? I had to read into this topic while writing the muxer, so it's possible I missed something! All of the files I muxed with high profiles worked great, though.

I don't think it's an error, per se, but if B-frame support is required, I think that we need to write a ctts box to provide the composition time offset (PTS - DTS) for each sample.

I started to make this change, but I noticed that for fastStart !== 'fragmented', we don't even keep sample times (timeToSampleTable, to write into the stts box), so then I got confused. Shouldn't we be tracking sample times and writing the stts box for fragments as well? Or is that not supported for some reason? I'm happy to work on this issue, but I need some guidance on the design choices so far.

Thank you!

May 21 '24 14:05 JHartman5

Thank you for the kind words! So, you're using this library in fragmented mode?

It's funny that you ask about B-frames as I'm thinking more about demuxing lately where you can't get around them. It wasn't quite clear from your question: How exactly is your tester encoding in a way that results in B-frames? Are they using a custom encoder, and not WebCodecs directly? I don't recall there being any way to control the presence of B-frames within the Web Codecs API itself. But please correct me in this regard!

You are right however; as soon as you introduce the possibility of PTS != DST, you need the ctts box. For fragmented MP4, you typically need different boxes that fulfill the same functionality, which is also why there's no need to care about the stts box in fragmented mode. ChatGPT tells me the sgpd and sbgp boxes are relevant for ctts-like behavior in fMP4, but I'd need to read the spec on this for more detail.

A thing I'm wondering about is how one would design the API of the muxer to support B-frames. You would basically require the user to pass encoded chunks into the muxer in decode order, but allow the possibility that they may have out-of-order presentation times. If out-of-order presentation times are detected, the muxer would need to automatically infer the ctts box from this, but this seems difficult since the muxer can't know in advance if it'll still receive a frame "in the past" or not. This is relevant because things like chunk finalization rely on checking the length of the current chunk and flushing it out once it reaches a certain duration. If the muxer then received a frame with an earlier presentation time, there's nothing it could do, since the chunk is already flushed. I guess what would work is if the muxer had a sort of "lookahead buffer", meaning it knows that |PTS - DTS| can be at most 100 milliseconds. Then, it would know when it can be confident about finalizing a chunk. None of this seems trivial, however, and I would always prefer PTS matching DTS for the simplicity it brings with it.

These are my thoughts on the matter so far. Definitely very interesting! Feel free to play around with it a bit, it just needs to be implemented very carefully.

May 21 '24 18:05 Vanilagy

Thank you for the kind words!

You're welcome! Thank you for replying!

So, you're using this library in fragmented mode?

That's right.

How exactly is your tester encoding in a way that results in B-frames?

Ah, good question, sorry for not being more clear. We have a live streaming platform, and we have a sort of preview service where the user connects with a web socket and they receive already-encoded video and audio frames (and other necessary metadata, like a track indication, PTS/DTS, ADTS frame headers for audio, etc.). So this user was streaming using OBS and specified bframes=3 (or something similar) in the advanced options for the x264 encoder. Which results in a sort of pyramidal GOP with I, P, and B frames.

I don't recall there being any way to control the presence of B-frames within the Web Codecs API itself. But please correct me in this regard!

As far as I know you're absolutely right!

Thanks for your detailed thoughts... I'm going to read them again, think about it some more, experiment a bit, and get back to you!

May 21 '24 18:05 JHartman5

Ah, so apparently the trun box lets you specify a per-sample sample_composition_time_offset if you set the sample‐composition‐time‐offsets‐present flag. That looks promising for the fragmented use case!

May 21 '24 21:05 JHartman5

Awesome! Keep me updated. And I'm happy this library helped with your use case :)

May 22 '24 08:05 Vanilagy

I opened at PR: https://github.com/Vanilagy/mp4-muxer/pull/47. I didn't find a contribution guide, so please feel free to tell me to go away and do some homework if I'm not adhering to some standards or something. :-)

May 22 '24 22:05 JHartman5

Super dope, thank you! I'll check it when I have time. This library has been coded 99% by me, so there hasn't been a need for a contribution guide, but making a change really isn't rocket science here. You did everything correctly.

Regarding testing: There is no test suite, but I run test.html and both demos after a change to get a quick grasp if things are still functioning. Been playing around with Playwright recently for another project, so I might eventually add automated testing at some point!

May 23 '24 07:05 Vanilagy

Back in the day I remember that I was concerned about handling of b-frames. But on Chromium (and perhaps other implementations) it turns out that b-frames on MF encoder (windows) is not set (CODECAPI_AVEncMPVDefaultBPictureCount defaults to 0). And on Mac I think this property is set set to false kVTCompressionPropertyKey_AllowFrameReordering which disables B frames.

That explains why not so many people has gotten into trouble so far ;) . But since there are lots of other uses like re-muxing or muxing non webcodec encoded streams as implied in this ticket, its really great to get b-frame muxing support (and who knows maybe webcodec encoders may one days also start to use b-frames).

Jun 10 '24 07:06 hlevring

Disabling B-frames makes sense from the WebCodecs API perspective, as objects like EncodedVideoChunk have no way to express a difference between decode time and composition time. But yes, I agree that if B-frames are a real possibility, this library should support them :)

Jun 10 '24 09:06 Vanilagy

Addressed in https://github.com/Vanilagy/mp4-muxer/pull/47. Thanks, @JHartman5!

Jun 12 '24 17:06 Vanilagy