essentia Processing batches of audio files through Essentia-Tensorflow pre-trained models

First of all thanks to the contributors of this library!

I'm currently trying to batch create embeddings from the AudioSet-VGGish pre-trained model

Am able to follow the docs to download the pretrained model and generate embeddings.

from essentia.standard import MonoLoader, TensorflowPredictVGGish

audio = MonoLoader(filename="audio.wav", sampleRate=16000, resampleQuality=4)()
model = TensorflowPredictVGGish(graphFilename="audioset-vggish-3.pb", output="model/vggish/embeddings")
embeddings = model(audio)

The problem is the examples don't show any implementation for batch_processing of multiple audio files. When I chucked the below code in for loop, it reinitializes tensorflow and runs really slow each iteration of the loop e.g

from essentia.standard import MonoLoader, TensorflowPredictVGGish
audio_paths = ["file1.wav", "file2.wav"]

for audio in audio_paths:
  audio = MonoLoader(filename="audio.wav", sampleRate=16000, resampleQuality=4)()
  model = TensorflowPredictVGGish(graphFilename="audioset-vggish-3.pb", output="model/vggish/embeddings")
  embeddings = model(audio)

I've tried it like this and it does the same thing, is there a way to process audio in batches or stop tensorflow from reinitializing each run?

Jul 10 '23 06:07 burstMembrane

Yes, you can initialize MonoLoader and TensorflowPredictVGGish outside the inference loop:

from essentia.standard import MonoLoader, TensorflowPredictVGGish
audio_paths = ["file1.wav", "file2.wav"]

loader = MonoLoader()
model = TensorflowPredictVGGish(graphFilename="audioset-vggish-3.pb", output="model/vggish/embeddings")

for audio in audio_paths:
    loader.configure(filename=audio, sampleRate=16000, resampleQuality=4)
    audio = loader()
    embeddings = model(audio)

Jul 10 '23 10:07 palonso

Yes, you can initialize MonoLoader and TensorflowPredictVGGish outside the inference loop:

from essentia.standard import MonoLoader, TensorflowPredictVGGish
audio_paths = ["file1.wav", "file2.wav"]

loader = MonoLoader()
model = TensorflowPredictVGGish(graphFilename="audioset-vggish-3.pb", output="model/vggish/embeddings")

for audio in audio_paths:
    audio = loader.configure(filename=audio, sampleRate=16000, resampleQuality=4)
    embeddings = model(audio)

loader = MonoLoader()
print(loader)

returns TypeError: __str__ returned non-string (type NoneType).

It seems loader.configure() is not behaving well, it always returns None, also in your code above.

Aug 04 '23 08:08 Galvo87

that's the expected return value for configure.

Aug 07 '23 14:08 palonso

Ok got it, but I still don't understand how this could work out...

Aug 08 '23 10:08 Galvo87

sorry @Galvo87! It was a mistake in my example script. I've updated the script and double-checked that it works.

The loader had to be configured first and then called.

Aug 09 '23 06:08 palonso

@burstMembrane, did you find a good solution for batch processing? I have 8 GPUs and want to extract a bunch of embeddings as quickly as possible

I noticed the "batch_size" argument, but it seems like that has to do with how many "patches" it will process from the input audio file, rather than an option to batch-process multiple audio files.

Any tips appreciated.

May 09 '24 15:05 jbm-composer

The simplest approach would be to modify this script to receive a list of files to process with something like argparse.

import argparse
from essentia.standard import MonoLoader, TensorflowPredictVGGish

def main(audio_paths):
    loader = MonoLoader()
    model = TensorflowPredictVGGish(graphFilename="audioset-vggish-3.pb", output="model/vggish/embeddings")

    for audio in audio_paths:
        loader.configure(filename=audio, sampleRate=16000, resampleQuality=4)
        audio = loader()
        embeddings = model(audio)

        # save the embeddings ...

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Process audio files using VGGish model")
    parser.add_argument("audio_files", nargs="+", help="List of audio files to process")
    args = parser.parse_args()
    main(args.audio_files)
)

then you can divide the filelist you want to process in 8 chunks, (e.g., split -n l/8 -d filelist filelist_part)

Finally you can launch one script per GPU:

CUDA_VISIBLE_DEVICES=0 python extract_embeddings.py $(< filelist_part00)
...
CUDA_VISIBLE_DEVICES=7 python extract_embeddings.py $(< filelist_part07)

May 09 '24 17:05 palonso

Thanks, yes, I actually realized there was something similar I could do, in just chunking my data into my GPU-count chunks (8) and having a separate serial process for each GPU. Works well. (I also used batchSize=-1, which think helps optimize a bit, though I'm not totally sure about that one.)

May 09 '24 18:05 jbm-composer

essentia essentia copied to clipboard

Processing batches of audio files through Essentia-Tensorflow pre-trained models

essentia
essentia copied to clipboard