transformers.js
transformers.js copied to clipboard
Using whisper-large-v3 to transcribe an audio buffer from a video fails
System Info
[email protected] @xenova/[email protected]
Environment/Platform
- [X] Website/web-app
- [ ] Browser extension
- [ ] Server-side (e.g., Node.js, Deno, Bun)
- [ ] Desktop app (e.g., Electron)
- [ ] Other (e.g., VSCode extension)
Description
I'm building a quick POC to see if I can transcribe audio to text from a video file. The process is:
- User selects video
- Get the audio channel from the video via AudioContext
- Convert to Float32Array
- Invoke transcriber
pipeline
I get this error:
Doesn't seem to be video-specific, you can use any .mp4 file and the error will happen.
Also not using a WebWorker, just trying to get this working to validate I can do this in the browser.
Reproduction
Here's my code:
<Input
id="dropzone-file"
type="file"
accept="video/mp4"
className="hidden"
onChange={async (e) => {
const fileList = e.target.files;
const file = fileList?.[0];
if (!file) {
return;
}
const url = URL.createObjectURL(file);
const video: HTMLVideoElement = document.createElement('video');
video.onloadedmetadata = async () => {
setProjectState({
file,
height: video.videoHeight,
width: video.videoWidth,
duration: video.duration,
});
const audioCtx = new AudioContext();
const audioElement = document.createElement('audio');
// Set the source of the audio element to the same source as the video
audioElement.src = video.src;
// Wait for the audio element to load
await audioElement.load();
// // Create a media element source node
// const sourceNode = audioCtx.createMediaElementSource(audioElement);
// Decode the audio data into a buffer array
const audioBuffer = await audioCtx.decodeAudioData(
await (await fetch(audioElement.src)).arrayBuffer(),
);
console.log(audioBuffer);
// Convert the audio buffer into a Float32Array
const numberOfChannels = audioBuffer.numberOfChannels;
const length = audioBuffer.length;
const float32Array = new Float32Array(numberOfChannels * length);
// Concatenate the audio data of all channels
for (let channel = 0; channel < numberOfChannels; channel++) {
const channelData = audioBuffer.getChannelData(channel);
float32Array.set(channelData, channel * length);
}
debugger;
const transcriber = await pipeline(
'automatic-speech-recognition',
'Xenova/whisper-large-v3',
);
const output = await transcriber(float32Array, {
return_timestamps: 'word',
});
console.log(output);
};
video.src = url;
video.load();
}}
/>