mediacapture-transform
mediacapture-transform copied to clipboard
Is MediaStreamTrackProcessor for audio necessary?
Extracting this discussion from https://github.com/w3c/mediacapture-transform/issues/4, since this was not really fully discussed there. The use cases for MediaStreamTrackProcessor for audio are unclear given its functionality largely overlaps with what WebAudio can do, WebAudio being already largely deployed in all major browsers.
CC @padenot
The fact that there is overlap does not mean that we should not support it. After all, for video there is overlap with existing features as well. Also, while there is overlap, the MediaStreamTrackProcessor model is quite different from the AudioWorklet model.
The question is if the MediaStreamTrackProcessor model is a better fit in some cases. I'll reach out to audio developers to get more feedback, but some things that have been mentioned are:
- access to the original timestamps of the audio source
- better WebCodecs integration
- there are use cases that do not fit naturally with the clock-based synchronous processing model of AudioWorklet (e.g., applications with high CPU requirements but without strong latency requirements). The MediaStreamTrackProcessor model might be a better match in these cases.
@guidou Quite a few of the WebCodecs Origin Trial participants are using it primarily for audio. Among game developers using WebCodecs for both audio and video, symmetry is an important aspect (e.g. using WebCodecs decode as an MSE substitute).
I fully agree that symmetry is an important benefit too for developers.
One use case to consider is https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html
@youennf are you referring to https://developer.mozilla.org/en-US/docs/Web/API/AudioWorklet (not yet in Safari) or to https://developer.mozilla.org/en-US/docs/Web/API/ScriptProcessorNode (available in all browsers, but deprecated)?
are you referring to https://developer.mozilla.org/en-US/docs/Web/API/AudioWorklet (not yet in Safari) or to
I am referring to AudioWorklet, which is available in Safari.
The PR to adjust the compatibility data has just been merged and it appears MDN is slightly out of date: https://github.com/mdn/browser-compat-data/pull/10129/files#r621975812
One use case to consider is https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html
This is indeed a good use case. It seems covered AFAIK by getUserMedia+MediaStreamAudioSourceNode+AudioWorklet.
Quite a few of the WebCodecs Origin Trial participants are using it primarily for audio.
Can you clarify which WebCodec original trial API they are primarily using for audio? Is it MediaStreamTrackProcessor?
I'll reach out to audio developers to get more feedback, but some things that have been mentioned are:
Thanks @guidou, this is helpful to identify the shortcomings of AudioWorklet. Based on that, we should indeed either improve WebAudio support (including API) or envision alternatives.
What was asked in the past is a pros and cons of AudioWorklet vs. audio MediaStreamTrackProcessor. So far, it seems that MediaStreamTrackProcessor could be shimed by AudioWorklet.
I fully agree that symmetry is an important benefit too for developers.
WebAudio API is very different from rendering API like Canvas/OffscreenCanvas and for good reasons: it was decided to solve a specific a problem in the best possible way.
By trying to build a single API for both audio and video, we miss the opportunity to build the best API dedicated for video. Symmetry is not always a good friend.
There are some known advantages of using AudioWorklet over MediaStreamTrackProcessor. With AudioWorklet, an application is able to implement its own buffering strategy and the best way to present data for processing.
For instance, an application might want to start with processing 10 ms chunks and will want to buffer 5 chunks of these. At some point though, to cope with networking, the application will switch to 50 ms chunks and will increase buffering to 5 chunks of 50ms.
This is not easily doable with MediaStreamTrackProcessor: maxBufferSize is fixed at construction time and audio frames size is not in the application control.
One use case to consider is https://ai.googleblog.com/2018/04/looking-to-listen-audio-visual-speech.html
This is indeed a good use case. It seems covered AFAIK by getUserMedia+MediaStreamAudioSourceNode+AudioWorklet.
Apologies if I'm missing something obvious, but it doesn't seem possible to process both the audio and video inputs in an AudioWorklet. Nor does it seem possible for the audio data to be obtained outside of the AudioWorklet so that the audio and video can be processed together in a regular worker.
We can share the audio data to a regular worker through SharedArrayBuffer if possible, postMessage otherwise. I was referring to the audio part of the use case, I agree the video part deserves a better API than canvas.
I think the question of whether something is necessary is the wrong one to ask, since arguably, nothing is necessary. For example, using getUserMedia+MediaStreamAudioSourceNode+AudioWorklet + some video processing API (such as MediaStreamTrackProcessor/MediaStreamTrackGenerator) in this context would be a lot more difficult than with having a symmetric API for audio and video. For starters SharedArrayBuffer requires cross-origin isolation. The setup of MediaStreamAudioSourceNode+AudioWorklet on one hand and Video processing somewhere else using completely different APIs with different programming models adds even more friction. Moreover, the unique advantages offered by AudioWorklet (e.g., real-time thread) do not apply to this specific use case.
I think this shows that there is real value in adding an audio version of the same API used for video. Keeping the bug open to continue the discussion.
It's possible to add controls for sample size and bufffer size to MediaStreamTrackProcessor if that's a requested feature. It isn't part of the minimal surface, but where to put the controls is obvious; raw audio data is easy to re-chunk.
It was mentioned in this thread that a MediaStreamTrackProcessor
for audio is necessary to synchronize audio and video when using WebCodecs.
But in case I didn't miss anything it's probably still hard to accurately encode a MediaStream
with WebCodecs even though there is MediaStreamTrackProcessor
for audio.
I tried to record the MediaStream
coming from the user's mic and camera in Chrome v105. It was obtained in the most simple way.
const mediaStream = await navigator.mediaDevices.getUserMedia({
audio: true,
video: true
})
I then used a MediaStreamTrackProcessor
for each MediaStreamTrack
to get the AudioData
and VideoFrame
respectively. However the timestamp of the video seems to start at 0 whereas the timestamp of the audio starts somewhere.
I think this is all fine according to the spec but it doesn't really help to synchronize the audio with the video. If I want to start the recording at a given point in time which VideoFrame
and which AudioData
are the first ones I should pass on to the encoder?
It would be nice to have a way of knowing the offset between the two timestamps. I think some API which says AudioData.timestamp === 62169.819898
and VideoFrame.timestamp === 0.566633
represent the same point in time would be really helpful.
Also I guess this all becomes very tricky when the recording is long enough for the two streams to drift apart.
FWIW, in Chrome, MSTP for audio is used 3X more than MSTP for video nowadays.
At Zoom we're currently using MediaStreamTrackProcessor for video, and WebAudio for audio (very similar to this pattern: https://developer.chrome.com/blog/audio-worklet-design-pattern#webaudio_powerhouse_audio_worklet_and_sharedarraybuffer).
It works, but there's a lot of complexity that comes with WebAudio and SharedArrayBuffers, and handling the case when SharedArrayBuffer is not available. Having MediaStreamTrackProcessor for audio would certainly simplify things.