moq-js icon indicating copy to clipboard operation
moq-js copied to clipboard

Fix audio

Open kixelated opened this issue 2 years ago • 11 comments

I disabled audio because I was getting tired of working on the player. It needs to be synchronized with video, which is actually kind of annoying thanks to WebAudio.

kixelated avatar Jun 08 '23 21:06 kixelated

i recommend adopting the Flash Player "Tin-Can" timing model:

  • when there is no (or no longer an) audio track, use the wall-clock to schedule video and timed events
  • when there is an audio track, the audio sample clock drives the system clock. when the first sample of a decoded audio frame is played, the system clock snaps to the timestamp of the coded media frame from which that sample was decoded. the system clock is then linearly projected forward from that instant, as needed, until the next timestamped audio frame comes along to snap the system clock to the next timestamp. at any particular time, the most recently decoded video frame having a timestamp not greater than the system clock should be on display.

as an example, tcaudio.js + tcaudioprocessor.js in my RTWebSocket repo implement a Tin-Can-like interface* for WebAudio playback (with a volume control even), and tcmedia.js uses WebCodecs to decode video & audio and play them back synchronized (including if audio frames are dropped), with adjustable buffering and jitter adaptation, using tcaudio.

the only tricky part is actually caused by WebCodecs: at least the last time i checked in the Chrome WebCodecs implementation, even though the documentation says that the timestamp is supposed to be carried through unmodified between feeding in a coded frame to an audio decoder and getting the decoded raw samples out, it actually isn't; instead the timestamps on the decoded frames are just incremented by the duration of each decoded frame starting from the last decoder init/flush, so if any frames are missing, the timestamps of the decoded samples will be out of sync and wrong. this requires some janky heuristics to work around. you can't flush the decoder after every frame because (for stateful codecs like AAC) that causes an audio glitch.

* with some of your favorites like bufferTime, bufferLength, currentTime (for NetStream.time), and status events like NetStream.Buffer.Full, NetStream.Buffer.Empty, and NetStream.Buffer.Flush.

zenomt avatar Oct 07 '23 18:10 zenomt

@zenomt How would you deal with clock drift without something like NetEQ? Playing audio at wall clock seems reasonable but it can't compensate for the fact that 20 ms on my maschine are not 20 ms on your machine. You have to stretch and shrink audio using something like NetEQ. Or do I miss something?

chrisprobst avatar Nov 02 '23 07:11 chrisprobst

my implementation handles jitter and clock drift the same way.:

  • if playout is too fast, you'll eventually underrun. underrun causes a rebuffering. my implementation has a tunable rebuffer-restart threshold, and for "as live as possible" it can be as low as one new sample.
  • if playout is too slow, the playout buffer will grow. if the minimum buffer level over a sliding sampling window (example: 16s for "very live", 2 minutes for "buffered playback") exceeds a configurable threshold, then discard some% of audio frames until the minimum buffer level falls below the threshold.

if you wanted to be extra fancy (my implementation doesn't do this), then to stave off an underrun, you could double some% of audio frames if the minimum buffered amount falls below a safety threshold.

for codecs like AAC and Opus, dropping or doubling audio frames isn't tremendously disruptive as long as the "some" percent is small (like less than 1%).

unfortunately, the Chrome implementation of the WebCodecs decoder doesn't carry through the timestamp on every coded audio frame, so dropping or doubling a frame will disrupt the timestamps on the output of the decoder. the workaround is to flush the decoder whenever you know you're dropping (or doubling) a coded frame, which will cause a little "pop" but reset/resync the timestamps on the decoded frames.

zenomt avatar Nov 02 '23 17:11 zenomt

Thanks for going into details.

chrisprobst avatar Nov 02 '23 17:11 chrisprobst

@zenomt I have trouble understanding how buffering solves the underrun issue effectively. With clock drift, every audio packet would play too fast and because audio is inherently linked to video, there would be essentially every audio packet an underrun or during avsync a short forced silence. I believe for smooth playout you really need something like NetEQ.

chrisprobst avatar Nov 03 '23 10:11 chrisprobst

most computer audio hardware is reasonably accurate, and pretty stable (if it was unstable you'd hear that easily, and if the sample rate was way off the pitch would be noticeably wrong). i've observed sample rates around 48008/s for a nominal rate of 48000/s (about 1.7 parts in 10000). for AAC and a nominal sample rate of 48000/s, each AAC frame (1024 samples) is 21.3 ms long. at the quantum of "AAC frame", and assuming a "very live" setting with an instantaneous resume (rather than accumulating a longer buffer), you'd underrun one frame time every 125 seconds (2 minutes), and that underrun would only be 21.3 ms of silence. if you accumulated a longer buffer when rebuffering, the instance of underruns requiring a rebuffer would be proportionally longer. for a half-second buffer and the above drift, you'd need to rebuffer (for half a second) every 49 minutes.

zenomt avatar Nov 03 '23 16:11 zenomt

Hey @kixelated,

I'd love to enable audio in this library and implement a decent form of A/V sync to make it more usable. What's the first thing I should look at, or where should I start? Any pointers would be greatly appreciated. :)

kelvinkirima014 avatar Jul 05 '24 03:07 kelvinkirima014

I think the audio worklet needs a rewrite, but I don't know the precise problem. The current approach of using ordered Streams kinda sucks and results in blindly queuing data with no expectation of when it will (and should) be played.

kixelated avatar Jul 06 '24 03:07 kixelated

I have been down a rabbit hole, turns out; web audio is hard. First goal is to get any kind of audio out even if it's noise as all we got right now is silence. Looking at the code here, it seems we only send the video canvas and never forward the audio channels?

I also encounter this warning after I run the code locally and try watching a published stream:

The AudioContext was not allowed to start. It must be resumed (or created) after a user gesture on the page.

Apparently could be fixed by making sure getAudioContext().resume() is called somewhere similar to what you did with watch here but I'm not sure where exactly that fits in the publisher.

kelvinkirima014 avatar Jul 08 '24 15:07 kelvinkirima014

gentle reminder that i posted links to running code using WebAudio, as well as a suggestion on a timing model for A/V sync, above in this Issue back in october 2023.

zenomt avatar Jul 08 '24 16:07 zenomt