mediacapture-transform Is MediaStreamTrackProcessor for audio necessary?

Extracting this discussion from https://github.com/w3c/mediacapture-transform/issues/4, since this was not really fully discussed there. The use cases for MediaStreamTrackProcessor for audio are unclear given its functionality largely overlaps with what WebAudio can do, WebAudio being already largely deployed in all major browsers.

Apr 30 '21 07:04 youennf

CC @padenot

Apr 30 '21 07:04 youennf

The fact that there is overlap does not mean that we should not support it. After all, for video there is overlap with existing features as well. Also, while there is overlap, the MediaStreamTrackProcessor model is quite different from the AudioWorklet model.
The question is if the MediaStreamTrackProcessor model is a better fit in some cases. I'll reach out to audio developers to get more feedback, but some things that have been mentioned are:

access to the original timestamps of the audio source
better WebCodecs integration
there are use cases that do not fit naturally with the clock-based synchronous processing model of AudioWorklet (e.g., applications with high CPU requirements but without strong latency requirements). The MediaStreamTrackProcessor model might be a better match in these cases.

May 01 '21 07:05 guidou

@guidou Quite a few of the WebCodecs Origin Trial participants are using it primarily for audio. Among game developers using WebCodecs for both audio and video, symmetry is an important aspect (e.g. using WebCodecs decode as an MSE substitute).

May 01 '21 16:05 aboba

I fully agree that symmetry is an important benefit too for developers.

May 01 '21 17:05 guidou

One use case to consider is https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html

May 03 '21 14:05 dogben

@youennf are you referring to https://developer.mozilla.org/en-US/docs/Web/API/AudioWorklet (not yet in Safari) or to https://developer.mozilla.org/en-US/docs/Web/API/ScriptProcessorNode (available in all browsers, but deprecated)?

May 03 '21 14:05 alvestrand

are you referring to https://developer.mozilla.org/en-US/docs/Web/API/AudioWorklet (not yet in Safari) or to

I am referring to AudioWorklet, which is available in Safari.

May 03 '21 14:05 youennf

The PR to adjust the compatibility data has just been merged and it appears MDN is slightly out of date: https://github.com/mdn/browser-compat-data/pull/10129/files#r621975812

May 03 '21 14:05 padenot

One use case to consider is https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html

This is indeed a good use case. It seems covered AFAIK by getUserMedia+MediaStreamAudioSourceNode+AudioWorklet.

Quite a few of the WebCodecs Origin Trial participants are using it primarily for audio.

Can you clarify which WebCodec original trial API they are primarily using for audio? Is it MediaStreamTrackProcessor?

May 04 '21 06:05 youennf

I'll reach out to audio developers to get more feedback, but some things that have been mentioned are:

Thanks @guidou, this is helpful to identify the shortcomings of AudioWorklet. Based on that, we should indeed either improve WebAudio support (including API) or envision alternatives.

What was asked in the past is a pros and cons of AudioWorklet vs. audio MediaStreamTrackProcessor. So far, it seems that MediaStreamTrackProcessor could be shimed by AudioWorklet.

May 04 '21 06:05 youennf

I fully agree that symmetry is an important benefit too for developers.

WebAudio API is very different from rendering API like Canvas/OffscreenCanvas and for good reasons: it was decided to solve a specific a problem in the best possible way.

By trying to build a single API for both audio and video, we miss the opportunity to build the best API dedicated for video. Symmetry is not always a good friend.

May 04 '21 06:05 youennf

There are some known advantages of using AudioWorklet over MediaStreamTrackProcessor. With AudioWorklet, an application is able to implement its own buffering strategy and the best way to present data for processing.

For instance, an application might want to start with processing 10 ms chunks and will want to buffer 5 chunks of these. At some point though, to cope with networking, the application will switch to 50 ms chunks and will increase buffering to 5 chunks of 50ms.

This is not easily doable with MediaStreamTrackProcessor: maxBufferSize is fixed at construction time and audio frames size is not in the application control.

May 04 '21 06:05 youennf

One use case to consider is https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html

This is indeed a good use case. It seems covered AFAIK by getUserMedia+MediaStreamAudioSourceNode+AudioWorklet.

Apologies if I'm missing something obvious, but it doesn't seem possible to process both the audio and video inputs in an AudioWorklet. Nor does it seem possible for the audio data to be obtained outside of the AudioWorklet so that the audio and video can be processed together in a regular worker.

May 04 '21 11:05 dogben

We can share the audio data to a regular worker through SharedArrayBuffer if possible, postMessage otherwise. I was referring to the audio part of the use case, I agree the video part deserves a better API than canvas.

May 04 '21 11:05 youennf

I think the question of whether something is necessary is the wrong one to ask, since arguably, nothing is necessary. For example, using getUserMedia+MediaStreamAudioSourceNode+AudioWorklet + some video processing API (such as MediaStreamTrackProcessor/MediaStreamTrackGenerator) in this context would be a lot more difficult than with having a symmetric API for audio and video. For starters SharedArrayBuffer requires cross-origin isolation. The setup of MediaStreamAudioSourceNode+AudioWorklet on one hand and Video processing somewhere else using completely different APIs with different programming models adds even more friction. Moreover, the unique advantages offered by AudioWorklet (e.g., real-time thread) do not apply to this specific use case.

I think this shows that there is real value in adding an audio version of the same API used for video. Keeping the bug open to continue the discussion.

Jun 18 '21 10:06 guidou

It's possible to add controls for sample size and bufffer size to MediaStreamTrackProcessor if that's a requested feature. It isn't part of the minimal surface, but where to put the controls is obvious; raw audio data is easy to re-chunk.

Jun 21 '21 05:06 alvestrand

It was mentioned in this thread that a MediaStreamTrackProcessor for audio is necessary to synchronize audio and video when using WebCodecs.

But in case I didn't miss anything it's probably still hard to accurately encode a MediaStream with WebCodecs even though there is MediaStreamTrackProcessor for audio.

I tried to record the MediaStream coming from the user's mic and camera in Chrome v105. It was obtained in the most simple way.

const mediaStream = await navigator.mediaDevices.getUserMedia({
    audio: true,
    video: true
})

I then used a MediaStreamTrackProcessor for each MediaStreamTrack to get the AudioData and VideoFrame respectively. However the timestamp of the video seems to start at 0 whereas the timestamp of the audio starts somewhere.

I think this is all fine according to the spec but it doesn't really help to synchronize the audio with the video. If I want to start the recording at a given point in time which VideoFrame and which AudioData are the first ones I should pass on to the encoder?

It would be nice to have a way of knowing the offset between the two timestamps. I think some API which says AudioData.timestamp === 62169.819898 and VideoFrame.timestamp === 0.566633 represent the same point in time would be really helpful.

Also I guess this all becomes very tricky when the recording is long enough for the two streams to drift apart.

Sep 05 '22 06:09 chrisguttandin

FWIW, in Chrome, MSTP for audio is used 3X more than MSTP for video nowadays.

Jan 30 '24 08:01 guidou

At Zoom we're currently using MediaStreamTrackProcessor for video, and WebAudio for audio (very similar to this pattern: https://developer.chrome.com/blog/audio-worklet-design-pattern#webaudio_powerhouse_audio_worklet_and_sharedarraybuffer).

It works, but there's a lot of complexity that comes with WebAudio and SharedArrayBuffers, and handling the case when SharedArrayBuffer is not available. Having MediaStreamTrackProcessor for audio would certainly simplify things.

Mar 07 '24 15:03 mehagar

mediacapture-transform mediacapture-transform copied to clipboard

Is MediaStreamTrackProcessor for audio necessary?

mediacapture-transform
mediacapture-transform copied to clipboard