python-audio-effects icon indicating copy to clipboard operation
python-audio-effects copied to clipboard

Slow performance on a relatively small dataset (numpy arrays)

Open straygar opened this issue 6 years ago • 6 comments

When trying to apply 10 reverb presets to 30 2-second waveform samples (represented as numpy arrays), the 300 sox calls take an unreasonable amount of time (close to an hour).

Is there any way to make batch effect application faster, other than by calling AudioEffectsChain() on each numpy array separately? Ideally, it would be great if this could scale to significantly more data than this.

straygar avatar Mar 20 '18 11:03 straygar

I'd love it if we could make this happen! Could you share what reverb settings you're using? sox is pretty fast for me so I have a hard time finding out why it's so terribly slow in your pipeline.

import multiprocessing

from librosa.util import example_audio_file               
from pysndfx.dsp import AudioEffectsChain

apply_audio_effects = AudioEffectsChain()\
    .highshelf()\
    .reverb()\
    .phaser()\
    .delay()\
    .lowshelf()

%%timeit
with multiprocessing.Pool() as pool:
    pool.map(apply_audio_effects, [example_audio_file()] * 300)

21.3 s +- 266 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

If I got this right, this is roughly 5 hours of 44.1 kHz / 16-bit stereo audio processed in roughly 20 seconds. I don't think sox is smart enough to cache and reuse the processed audio when it's started in the multiprocessing pool but we should double check that.

Finally, sox is single threaded by default. I tried setting --multi-threaded and running the tests but that made it a lot slower (weirdly?).

carlthome avatar Mar 21 '18 16:03 carlthome

I am also having problems with speed. I am training a speech recognition (using pairs of [sound , text]). As your code works well in python (and particularly on numpy), I started using it as data augmentation. So, every time a sound is going to be fed to my network, I pass it through a function like this one:

def perform_aug(tsound,sr):
    aug_fx = AudioEffectsChain()
        
    if (random.random()<0.5):
        aug_fx.speed(random.uniform(0.9,1.1))
        
    if (random.random()<0.9):
        aug_fx.tempo(random.uniform(0.8,1.2))
        
    if (random.random()<0.9):
        aug_fx.pitch(random.uniform(-200,200))
        
    if (random.random()<0.2):
        aug_fx.highshelf()

    if (random.random()<0.2):
        aug_fx.lowshelf()
        
    if (random.random()<0.2):
        aug_fx.highpass(random.uniform(200,400))
        
    if (random.random()<0.2):
        aug_fx.lowpass(random.uniform(200,400))
        
    if (random.random()<0.5):
        aug_fx.reverb(random.uniform(10,50))
        
    out = aug_fx(tsound)
    return out,aug_fx

This works pretty well, except that the training runs like 10x slower. How do you think I could make it faster?

bernardohenz avatar Apr 03 '18 19:04 bernardohenz

The key for getting good performance is to data augment the next batch while training on the current batch.

You could do this with either multiprocessing.Queue or async/await. If you're working in TensorFlow, I'd look into tf.data:

from pysndfx import AudioEffectsChain
import tensorflow as tf


def load(x):
    f = lambda x, i: AudioEffectsChain().pitch(i[0])(x.decode())
    i = tf.random_uniform([1], -50.0, 50.0)
    x = tf.py_func(f, [x, i], tf.float32)
    return x


dataset = (tf.data.Dataset
    .list_files('*.mp3')
    .map(load, num_parallel_calls=batch_size)
    .padded_batch(batch_size, [None, None])
    .prefetch(1))

Without knowing more about your particular program it's hard to give good advice though!

carlthome avatar Apr 05 '18 13:04 carlthome

I am actually using DeepSpeech from Mozilla (https://github.com/mozilla/DeepSpeech), which is implemented over TensorFlow.

Right now, I am just performing data augmentation right before DeepSpeech extract the features from the audio, to be specific, on this line (https://github.com/mozilla/DeepSpeech/blob/master/util/audio.py#L67).

Considering how it is implemented (loading the audio when constructing the batch), I don't know if I am going to be able to apply your suggestion.

Ps: I do believe DeepSpeech makes use of parallelized units to account for loading while training.

bernardohenz avatar Apr 05 '18 14:04 bernardohenz

Yes, it looks like they're doing asynchronous prefetching up to the next minibatch. Have you tried raising this?

carlthome avatar Apr 05 '18 14:04 carlthome

I’ve tried with threads_per_queue=8, and it improved the speed performance for like 20~25%. Nonetheless, it is still much slower than running without augmentation.

I do not believe that the augs are so computationally expensive, thinking that they would be less expensive than accessing the disk for reading the audio file. Maybe the number generator, or the way SoX works is not optimal, don't know.

But thanks for the help, even with a little bit slower, the augmentations improved our model's accuracy 😄

bernardohenz avatar Apr 06 '18 13:04 bernardohenz