gradio
gradio copied to clipboard
Streaming Audio is choppy
Describe the bug
When streaming audio, the reconstructed signal of the streamed chunks sounds choppy.
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
To reproduce, run:
import gradio as gr
import numpy as np
def run(audio, state):
sr, data = audio
if state is None:
state = data
else:
state = np.concatenate([state, data])
audio = sr, state
return audio, state
gr.Interface(
fn=run,
inputs=[
gr.Audio(source="microphone", type="numpy", streaming=True),
"state"
],
outputs=[
"audio",
"state"
],
live=True,
).launch()
On your local machine and listen to the recorded audio. The problem also persists for the audio streaming demo: https://github.com/gradio-app/gradio/tree/main/demo/stream_audio
Screenshot
No response
Logs
-
System Info
Gradio 3.0.24
Ubuntu 20.04.4 LTS
Firefox 101.0.1 (64-bit)
Severity
blocking upgrade to latest gradio version
thx for reporting @yannickfunk, it seems like a bug, @aliabid94 could you take a look on what's going on?
Similar issue: #1332
Is there any progress here? This feature would be much appreciated
Hi @yannickfunk thanks for reporting the issue. We have not had a chance to look into this issue yet, but we'll take a closer look at this issue next week
If you need some assistance here, I am willing to help!
Hi @abidlabs @FarukOzderim and @aliabid94. I did some research and found out a lot of stuff, I can give you some conclusions:
The choppy audio is coming from the start() and stop() of the MediaRecorder because this leads to artifacts in the audio. It is the most straightforward way to accomplish streaming audio, but for ML showcases it is suboptimal, because the resulting audio sounds choppy.
An implementation aligned with the MediaRecorder Api, would be to make use of the timeslice argument when calling recorder.start(timeslice). For a timeslice of 500, the mediarecorder would fire "dataavailable" every 500ms and provide the current chunk of captured audio. (See https://developer.mozilla.org/en-US/docs/Web/API/MediaRecorder)
Fires periodically each time timeslice milliseconds of media have been recorded (or when the entire media has been recorded, if timeslice wasn't specified). The event, of type BlobEvent, contains the recorded media in its data property.
The problem of this approach is that only the first yielded chunk contains the codec (or audio format) information, so subsequent chunks are useless without the information of the first chunk (i.e. cannot be saved as a valid wav file). Prepending the codec information for every chunk would be the obvious solution, but this is not possible for codecs like opus, since they have variable header lengths etc. (See https://stackoverflow.com/questions/48891897/send-chunks-from-mediarecorder-to-server-and-play-it-back-in-the-browser)
You can't just needle-drop into the WebM stream. WebM/Matroska require some setup to initialize the track info and what not. After that, you'll have Clusters, and you have to start on a Cluster. Additionally, Chrome is going to require that each Cluster start on a keyframe, which you're not going to be able to guarantee with the data from MediaRecorder. Therefore, server-side transcoding (or at least, some nasty hacking on the VP8 stream) is needed.
The cleanest solution here, would be to stream the chunks to the server and do the transcoding on the server side (without saving every audio chunk as a wav file).
I got a quick and dirty solution to work using the extendable-media-recorder package (See https://github.com/chrisguttandin/extendable-media-recorder). With this package you can ensure, that the chunks yielded by the media recorder are in PCM (WAV format). The WAV format has a fixed header of 44 Bytes and can then be manually prepended for every yielded chunk. Every chunk can then be converted to base64 and saved as a wav file.
TLDR: There are a few caveats and no obvious solution.
What do you think?
This is so helpful @yannickfunk! Trying it out now
Hey @yannickfunk I've been trying to implement your suggestion and got a bit stuck, if you could help me out. You said you got a quick and dirty version working, could you share that code? I've posted my quick and dirty version that isn't producing a parseable base64 below. (the prepare_audio method is the only relevant part of the code I believe)
Also another question: why can't I use recorder = new MediaRecorder(stream, {mimeType: "audio/webm;codecs=pcm"}); instead of this extendable-media-recorder library?
Thanks for the help!
<html>
<body>
<h1>test audio streamer</h1>
<button onclick="record()">record</button>
<button onclick="stop()">stop</button>
<hr />
<audio controls></audio>
<script>
let recorder;
let audio_chunks = [];
let player;
let audio_blob;
let inited = false;
let recording = false;
let pending = false;
let last_chunk_index = 0;
let header_chunk;
function blob_to_data_url(blob) {
return new Promise((fulfill, reject) => {
let reader = new FileReader();
reader.onerror = reject;
reader.onload = () => fulfill(reader.result);
reader.readAsDataURL(blob);
});
}
async function post(data) {
pending = true;
await fetch("http://localhost:4000/stream", {
method: "POST",
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
"data": data
})
}
).then(r => r.json()
).then(r => {
document.querySelector("audio").src = r.value
});
pending = false;
}
async function prepare_audio() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
recorder = new MediaRecorder(stream, {mimeType: "audio/webm;codecs=pcm"});
recorder.addEventListener("dataavailable", async (event) => {
audio_chunks.push(event.data);
if (!pending) {
if (last_chunk_index === 0) {
let first_chunk = await audio_chunks[0].arrayBuffer();
header_chunk = first_chunk.slice(0, 44);
var chunk_set = audio_chunks;
} else {
var chunk_set = [header_chunk].concat(audio_chunks.slice(last_chunk_index));
}
audio_blob = new Blob(chunk_set, { type: "audio/wav" });
last_chunk_index = audio_chunks.length;
value = await blob_to_data_url(audio_blob)
post(value);
}
});
inited = true;
}
async function record() {
recording = true;
audio_chunks = [];
if (!inited) await prepare_audio();
recorder.start(500);
}
const stop = () => {
recorder.stop();
recording = false;
};
</script>
</body>
</html>
Oh I realize I'm still using webm with the PCM codec, I imagine that's causing the issue. Will try the library you recommended. (still do send your code please!)
Please find my code here: https://gist.github.com/yannickfunk/f5724e9af72f2a87b07f04df015e6d66
Oh I realize I'm still using webm with the PCM codec, I imagine that's causing the issue. Will try the library you recommended. (still do send your code please!)
Yes I assume webm containers form clusters and you can only use the chunk as valid data, if it is the beginning of a cluster
@aliabid94 did you manage to get it to work?
Taking a look now, thanks @yannickfunk!
Opened PR https://github.com/gradio-app/gradio/pull/2351, thanks so much @yannickfunk! Just had to tweak your code to support "pending", where the backend function isn't yet complete so we can't dispatch right away.