python-audio-effects
python-audio-effects copied to clipboard
Slow performance on a relatively small dataset (numpy arrays)
When trying to apply 10 reverb presets to 30 2-second waveform samples (represented as numpy arrays), the 300 sox calls take an unreasonable amount of time (close to an hour).
Is there any way to make batch effect application faster, other than by calling AudioEffectsChain()
on each numpy array separately? Ideally, it would be great if this could scale to significantly more data than this.
I'd love it if we could make this happen! Could you share what reverb settings you're using? sox is pretty fast for me so I have a hard time finding out why it's so terribly slow in your pipeline.
import multiprocessing
from librosa.util import example_audio_file
from pysndfx.dsp import AudioEffectsChain
apply_audio_effects = AudioEffectsChain()\
.highshelf()\
.reverb()\
.phaser()\
.delay()\
.lowshelf()
%%timeit
with multiprocessing.Pool() as pool:
pool.map(apply_audio_effects, [example_audio_file()] * 300)
21.3 s +- 266 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
If I got this right, this is roughly 5 hours of 44.1 kHz / 16-bit stereo audio processed in roughly 20 seconds. I don't think sox
is smart enough to cache and reuse the processed audio when it's started in the multiprocessing pool but we should double check that.
Finally, sox
is single threaded by default. I tried setting --multi-threaded
and running the tests but that made it a lot slower (weirdly?).
I am also having problems with speed. I am training a speech recognition (using pairs of [sound , text]). As your code works well in python (and particularly on numpy), I started using it as data augmentation. So, every time a sound is going to be fed to my network, I pass it through a function like this one:
def perform_aug(tsound,sr):
aug_fx = AudioEffectsChain()
if (random.random()<0.5):
aug_fx.speed(random.uniform(0.9,1.1))
if (random.random()<0.9):
aug_fx.tempo(random.uniform(0.8,1.2))
if (random.random()<0.9):
aug_fx.pitch(random.uniform(-200,200))
if (random.random()<0.2):
aug_fx.highshelf()
if (random.random()<0.2):
aug_fx.lowshelf()
if (random.random()<0.2):
aug_fx.highpass(random.uniform(200,400))
if (random.random()<0.2):
aug_fx.lowpass(random.uniform(200,400))
if (random.random()<0.5):
aug_fx.reverb(random.uniform(10,50))
out = aug_fx(tsound)
return out,aug_fx
This works pretty well, except that the training runs like 10x slower. How do you think I could make it faster?
The key for getting good performance is to data augment the next batch while training on the current batch.
You could do this with either multiprocessing.Queue
or async/await
. If you're working in TensorFlow, I'd look into tf.data:
from pysndfx import AudioEffectsChain
import tensorflow as tf
def load(x):
f = lambda x, i: AudioEffectsChain().pitch(i[0])(x.decode())
i = tf.random_uniform([1], -50.0, 50.0)
x = tf.py_func(f, [x, i], tf.float32)
return x
dataset = (tf.data.Dataset
.list_files('*.mp3')
.map(load, num_parallel_calls=batch_size)
.padded_batch(batch_size, [None, None])
.prefetch(1))
Without knowing more about your particular program it's hard to give good advice though!
I am actually using DeepSpeech from Mozilla (https://github.com/mozilla/DeepSpeech), which is implemented over TensorFlow.
Right now, I am just performing data augmentation right before DeepSpeech extract the features from the audio, to be specific, on this line (https://github.com/mozilla/DeepSpeech/blob/master/util/audio.py#L67).
Considering how it is implemented (loading the audio when constructing the batch), I don't know if I am going to be able to apply your suggestion.
Ps: I do believe DeepSpeech makes use of parallelized units to account for loading while training.
Yes, it looks like they're doing asynchronous prefetching up to the next minibatch. Have you tried raising this?
I’ve tried with threads_per_queue=8
, and it improved the speed performance for like 20~25%. Nonetheless, it is still much slower than running without augmentation.
I do not believe that the augs are so computationally expensive, thinking that they would be less expensive than accessing the disk for reading the audio file. Maybe the number generator, or the way SoX works is not optimal, don't know.
But thanks for the help, even with a little bit slower, the augmentations improved our model's accuracy 😄