transformers.js icon indicating copy to clipboard operation
transformers.js copied to clipboard

Using whisper-large-v3 to transcribe an audio buffer from a video fails

Open pthieu opened this issue 1 year ago • 0 comments

System Info

[email protected] @xenova/[email protected]

Environment/Platform

  • [X] Website/web-app
  • [ ] Browser extension
  • [ ] Server-side (e.g., Node.js, Deno, Bun)
  • [ ] Desktop app (e.g., Electron)
  • [ ] Other (e.g., VSCode extension)

Description

I'm building a quick POC to see if I can transcribe audio to text from a video file. The process is:

  1. User selects video
  2. Get the audio channel from the video via AudioContext
  3. Convert to Float32Array
  4. Invoke transcriber pipeline

I get this error: CleanShot 2024-02-09 at 11 39 51@2x

Doesn't seem to be video-specific, you can use any .mp4 file and the error will happen.

Also not using a WebWorker, just trying to get this working to validate I can do this in the browser.

Reproduction

Here's my code:

<Input
        id="dropzone-file"
        type="file"
        accept="video/mp4"
        className="hidden"
        onChange={async (e) => {
          const fileList = e.target.files;
          const file = fileList?.[0];

          if (!file) {
            return;
          }

          const url = URL.createObjectURL(file);
          const video: HTMLVideoElement = document.createElement('video');
          video.onloadedmetadata = async () => {
            setProjectState({
              file,
              height: video.videoHeight,
              width: video.videoWidth,
              duration: video.duration,
            });

            const audioCtx = new AudioContext();
            const audioElement = document.createElement('audio');

            // Set the source of the audio element to the same source as the video
            audioElement.src = video.src;

            // Wait for the audio element to load
            await audioElement.load();

            // // Create a media element source node
            // const sourceNode = audioCtx.createMediaElementSource(audioElement);

            // Decode the audio data into a buffer array
            const audioBuffer = await audioCtx.decodeAudioData(
              await (await fetch(audioElement.src)).arrayBuffer(),
            );
            console.log(audioBuffer);

            // Convert the audio buffer into a Float32Array
            const numberOfChannels = audioBuffer.numberOfChannels;
            const length = audioBuffer.length;
            const float32Array = new Float32Array(numberOfChannels * length);
            // Concatenate the audio data of all channels
            for (let channel = 0; channel < numberOfChannels; channel++) {
              const channelData = audioBuffer.getChannelData(channel);
              float32Array.set(channelData, channel * length);
            }
            debugger;

            const transcriber = await pipeline(
              'automatic-speech-recognition',
              'Xenova/whisper-large-v3',
            );
            const output = await transcriber(float32Array, {
              return_timestamps: 'word',
            });

            console.log(output);
          };

          video.src = url;
          video.load();
        }}
      />

pthieu avatar Feb 09 '24 16:02 pthieu