transformers.js icon indicating copy to clipboard operation
transformers.js copied to clipboard

v3: Add RawAudio class

Open Th3G33k opened this issue 10 months ago • 5 comments

Following messages from #680

The 'save to wav' is my own simple implementation, using file specs, and hex viewer of a generated wav file.

Below the changes :

  • added RawAudio class, with .save(path) (support browser, webworker and nodejs)
  • modified some audio pipeline, to return RawAudio object
  • added properties isBrowserEnv and isWebworkerEnv to env

Example use :

const synthesizer = await pipeline('text-to-speech', 'Xenova/mms-tts-eng');
const output = await synthesizer('Hello, my dog is cute');
output.save("audio.wav");

Th3G33k avatar Apr 05 '24 09:04 Th3G33k

Thank you @xenova for the review.

Here's the changes I have made :

  • split into two functions : toBlob() and save(path)
  • check type in constructor()
  • in save(), check running environment first before proceeding
  • reduce memory footprint, by using new Blob([wav_header, audio]) instead of allocating additional TypedArray new Uint8Array(buf_size + wav_header.length)
  • add saveBlob(path, blob) in utils/core.js, and use it in RawAudio and RawImage, to directly save blob in the web

Th3G33k avatar Apr 11 '24 09:04 Th3G33k

Thanks! 🤗 Would you mind benchmarking/comparing your code with https://www.npmjs.com/package/audiobuffer-to-wav, which I used in a demo a few months ago. Also, at the moment, we only support 1-channel audios, but their code supports 2-channel + interleaving (see here), and might be good to include.

Other than that, I like the abstractions you introduced for the RawImage and RawAudio classes, and this will be perfect to merge into the v3 branch for a musicgen demo I'm working on 🔥

xenova avatar Apr 12 '24 00:04 xenova

I have added support for 2 channels audio + interleave.

interleave(keepOriginalValues) will use a new buffer of length * 2 (keeping original), or a new buffer of length * 1 (overwriting original audio data)

Below a quick benchmark, comparing with encodeWAV(samples) used in the demo.

function benchmark(){
    let i, input, output

    console.time('encodeWAV')
    for(i=0; i<20000; i++){
        input = new Float32Array(i).fill(i)
        output = encodeWAV(input)
        output = new Blob([output])
    }
    console.timeEnd('encodeWAV')

    console.time('RawAudio')
    for(i=0; i<20000; i++){
        input = new Float32Array(i).fill(i)
        output = new RawAudio(input, 16000)
        output = output.toBlob()
    }
    console.timeEnd('RawAudio')
}

/*
encodeWAV: 3216.6669921875 ms
RawAudio: 2702.23291015625 ms
---
encodeWAV: 3296.2138671875 ms
RawAudio: 2768.235107421875 ms
*/

encodeWAV is slower, since it's hard copy all audio values, into a new Buffer.

    for (let i = 0; i < samples.length; ++i, offset += 4) {
        view.setFloat32(offset, samples[i], true)
    }

unit test for interleave

let audio = new RawAudio([new Float32Array([1,2,3,4,5]), new Float32Array([1,2,3,4,5])], 16000)
console.log(audio.interleave(true)[0].toString() == '1,1,2,2,3')

Th3G33k avatar Apr 12 '24 10:04 Th3G33k

Thanks again! Just letting you know this PR is marked for the next release :)

xenova avatar Apr 22 '24 10:04 xenova

I have merged branch v3 #545 into this PR

Th3G33k avatar May 08 '24 08:05 Th3G33k