DeepSpeech online mix noise audio data in training step

Mixing noisy data into training file before runtime could cause data monotonicity, but mixing noisy data in runtime could cause very bad performance, if we read each noise audio to augment each training row. (Ex: for HDD disk, the duration of mixing one audio is almost 100 times slower than freq_time_mask does).

To reduce online mixing time, I use another tf.Dataset to cache noise audio array, then mix them to training data.

usage:

python -u DeepSpeech.py --noshow_progressbar \
  --train_files data/ldc93s1/ldc93s1.csv \
  --test_files data/ldc93s1/ldc93s1.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 200 \
  --epochs 200 \
  --checkpoint_dir <checkpoint_dir> \
  --audio_aug_mix_noise_walk_dirs <directory1-contains-wav-files>,<directory2-contains-wav-files>

Just specify the noise file directory, the process will automatically walk through the whole directory recursively, and collect .wav files (but it doesn't checkout the sample rate).
This program assume every volume of noise audio have been maximized, to save the calculation time of each speech/noise volume balance, it just simply divide speech audio with value between 0~-10 db, and divide noise audio with value between -25~-50 db
The augment time can be as fast as freq_time_mask
--audio_aug_mix_noise_walk_dirs can set multi dirs with comma separated.

To manually adjust volume loudness suppression:

python -u DeepSpeech.py \
...
--audio_aug_mix_noise_max_noise_db -25 \
--audio_aug_mix_noise_min_noise_db -50 \
--audio_aug_mix_noise_max_audio_db 0 \
--audio_aug_mix_noise_min_audio_db -10 \
...

If your noise files are pure non-speaker noise, my experience paramters is --audio_aug_mix_noise_max_noise_db -15, --audio_aug_mix_noise_min_noise_db -25
If your noise files are from speakers, like cocktail party, my experience paramters is --audio_aug_mix_noise_max_noise_db -30, --audio_aug_mix_noise_min_noise_db -50, otherwise, the voice can have a chance to cover the main speaker's volume.
If you want to cache audio array into local disk, set --audio_aug_mix_noise_cache <your cache path>, otherwise cache in memory.

Dec 31 '19 09:12 mychiux413

No Taskcluster jobs started for this pull request


The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.

Dec 31 '19 09:12 community-tc-integration[bot]

I tested it with Freesound Dataset Kaggle 2019 which has about 103h of noise data. Everything worked as intended. Only i didnt see a great difference in my training results (using Voxforge DE dataset). Maybe its too small.

Feb 05 '20 17:02 DanBmh

I tested it with Freesound Dataset Kaggle 2019 which has about 103h of noise data. Everything worked as intended. Only i didnt see a great difference in my training results (using Voxforge DE dataset). Maybe its too small.

DId u mean.. the noise dataset is small or Voxforge dataset is small, comparatively. One suggestion: If your feel noise dataset is small, you can use rnnoise's (https://people.xiph.org/~jm/demo/rnnoise/rnnoise_contributions.tar.gz) dataset

Feb 06 '20 08:02 alokprasad

I did mean the voxforge dataset. It has only around 32h of speech data.

i think rnnoise dataset is smaller than the freesound one (6 vs 22 gb, did not find the length in hours).

Also the noise files of rnnoise are in .raw format and freesound already has .wav format. So you need to convert them before to wav somehow.

Feb 06 '20 12:02 DanBmh

To use rnnoise datasets, we should normalize the volume and convert frame rate to 16000 manually, and many of rnnoise audio are almost no sound without normalizing volume. This mix noise process assume every single noise file volume were maximized, so it doesn't calculate dBFS to balance speech/noise volume when processing.

Feb 12 '20 09:02 mychiux413

@mychiux413 any idea how can be this done ? it should be online process ?

Feb 12 '20 12:02 alokprasad

@mychiux413 any idea how can be this done ? it should be online process ?

you should prepare normalized noise files by yourself before training start.

there is no standard way to normalize volume, I can only offer an example for you, you can optimize the script by yourself, and don't forget to listen the output audio to make sure everything sounds well.

notice:

I use pydub in the example, before pip install pydub, you should install ffmpeg by sudo apt-get install ffmpeg
the raw data I've downloaded from rnnoise is .raw, which should be manually specified frame rate, sample size, channel
some rnnoise data duration are almost 5 minutes, which is unnecessary in online mixing, so the example split them into 30 secs around.
the script is under python environment 3.7 (typing supported)

usage:

python <python_file.py> --from_dir <directory include rnnoise data> --to_dir <directory to output normalized data>

from __future__ import absolute_import, division, print_function
from pydub import AudioSegment
from multiprocessing import Pool
from functools import partial
import math
import argparse
import sys
import os


def detect_silence(sound: AudioSegment, silence_threshold=-50.0,
                   chunk_size=10) -> (int, int):
    start_trim = 0  # ms
    sound_size = len(sound)
    assert chunk_size > 0  # to avoid infinite loop
    while sound[start_trim:(
            start_trim +
            chunk_size)].dBFS < silence_threshold and start_trim < sound_size:
        start_trim += chunk_size

    end_trim = sound_size
    while sound[(end_trim - chunk_size):end_trim].dBFS < silence_threshold \
            and end_trim > 0:
        end_trim -= chunk_size

    start_trim = min(sound_size, start_trim)
    end_trim = max(0, end_trim)

    return min([start_trim, end_trim]), max([start_trim, end_trim])


def trim_silence_audio(sound: AudioSegment,
                       silence_threshold=-50.0,
                       chunk_size=10) -> AudioSegment:
    start_trim, end_trim = detect_silence(sound, silence_threshold, chunk_size)
    return sound[start_trim:end_trim]


def convert(filename: str, src_dir: str, dst_dirpath: str, dirpath: str,
            normalize: bool, trim_silence: bool, min_duration_seconds: float,
            max_duration_seconds: float):
    if not filename.endswith(('.wav', '.raw')):
        return
    filepath = os.path.join(dirpath, filename)
    if filename.endswith('.wav'):
        sound: AudioSegment = AudioSegment.from_file(filepath)
    else:
        try:
            sound: AudioSegment = AudioSegment.from_raw(filepath,
                                                        sample_width=2,
                                                        frame_rate=44100,
                                                        channels=1)
        except Exception as err:
            print('[retry] {}'.format(err))
            try:
                sound: AudioSegment = AudioSegment.from_raw(filepath,
                                                            sample_width=2,
                                                            frame_rate=48000,
                                                            channels=1)
            except Exception as err:
                print('bypass audio {}, got error: {}'.format(filepath, err))
                return
        try:
            sound = sound.set_frame_rate(16000)
        except Exception as err:
            print('[bypass] {}'.format(err))
            return

    n_splits: int = max(
        1, math.floor(sound.duration_seconds / max_duration_seconds))
    chunk_duration_ms = math.ceil(len(sound) / n_splits)
    chunks = []
    for i in range(n_splits):
        end_ms = min((i + 1) * chunk_duration_ms, len(sound))
        chunk = sound[(i * chunk_duration_ms):end_ms]
        chunks.append(chunk)
    for i, chunk in enumerate(chunks):
        dst_path = os.path.join(dst_dirpath, str(i) + '_' + filename)
        if dst_path.endswith('.raw'):
            dst_path = dst_path[:-4] + '.wav'
        if os.path.exists(dst_path):
            print('audio exists: {}'.format(dst_path))
            return
        if normalize:
            chunk = chunk.normalize()
            if chunk.dBFS < -30.0:
                chunk = chunk.compress_dynamic_range().normalize()
            if chunk.dBFS < -30.0:
                chunk = chunk.compress_dynamic_range().normalize()
        if trim_silence:
            chunk = trim_silence_audio(chunk)

        if chunk.duration_seconds < min_duration_seconds:
            return
        chunk.export(dst_path, format='wav')


def main(src_dir: str,
         dst_dir: str,
         min_duration_seconds: float,
         max_duration_seconds: float,
         normalize=True,
         trim_silence=True):
    assert os.path.exists(src_dir)
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir, exist_ok=False)
    src_dir = os.path.abspath(src_dir)
    dst_dir = os.path.abspath(dst_dir)

    # n_data = 0
    for dirpath, _, filenames in os.walk(src_dir):
        dirpath = os.path.abspath(dirpath)
        dst_dirpath = os.path.join(dst_dir,
                                   dirpath.replace(src_dir, '').lstrip('/'))
        print('converting dirpath: {} -> {}'.format(dirpath, dst_dirpath))
        if not os.path.exists(dst_dirpath):
            os.makedirs(dst_dirpath, exist_ok=False)

        convert_func = partial(convert,
                               src_dir=src_dir,
                               dst_dirpath=dst_dirpath,
                               dirpath=dirpath,
                               normalize=normalize,
                               trim_silence=trim_silence,
                               min_duration_seconds=min_duration_seconds,
                               max_duration_seconds=max_duration_seconds)
        p = Pool()
        p.map(convert_func, filenames)


if __name__ == "__main__":
    PARSER = argparse.ArgumentParser(description='Optimize noise files')
    PARSER.add_argument('--from_dir',
                        help='Convert wav from directory',
                        type=str)
    PARSER.add_argument('--to_dir', help='save wav to directory', type=str)
    PARSER.add_argument('--min_sec',
                        help='min duration seconds of saved file',
                        type=float,
                        default=1.0)
    PARSER.add_argument('--max_sec',
                        help='max duration seconds of saved file',
                        type=float,
                        default=30.0)
    PARSER.add_argument('--normalize',
                        action='store_true',
                        help='Trim silence, default is true',
                        default=True)
    PARSER.add_argument('--trim',
                        action='store_true',
                        help='Trim silence, default is true',
                        default=True)
    PARAMS = PARSER.parse_args()

    main(PARAMS.from_dir, PARAMS.to_dir, PARAMS.min_sec, PARAMS.max_sec,
         PARAMS.normalize, PARAMS.trim)

Feb 17 '20 07:02 mychiux413

there is no standard way to normalize volume, I can only offer an example for you, you can optimize the script by yourself, and don't forget to listen the output audio to make sure everything sounds well.

Could you add this script to your pull request?

I added a progressbar and a summary to it, feel free to copy it back. The updated code is here: https://github.com/DanBmh/deepspeech-german/blob/master/data/normalize_noise_audio.py

Feb 17 '20 14:02 DanBmh

I added bin/normalize_noise_audio.py, and did some modifications:

Removed typing for environment compatibility
Fixed pylint error, added warning message for ImportError of tqdm & pydub, because they are not standard packages in requirement.txt
Replaced seconds_to_hours() with util/feeding.py::secs_to_hours()

Usage:

python bin/normalize_noise_audio.py --from_dir <directory include noise data> --to_dir <directory to output normalized data>

Feb 19 '20 02:02 mychiux413

@mychiux413 anyway we can dump the mixed files and see how effective is the mixing of noise to speech file.just to make sure mixing is proper

Feb 20 '20 07:02 alokprasad

@alokprasad You're right, in fact, all the augmented audio should be able to be reviewed in pipeline, even augment on spectrogram like pitch/tempo/mask..., or we would not have a concept to tune the proper parameters. But in tensorflow's pipeline, it's not as simple as offline augmentation does, we should dump audio data into tensorboard by tf.summary.audio, and I'm still study this method, also trying to figure out how much refactoring this will affect.

Feb 20 '20 09:02 mychiux413

@mychiux413 I also tried to save the audio using tf.print 's output_stream option in following function

"def augment_noise"
    noise_ratio = tf.math.pow(10.0, choosen_noise_db / 10)
    mixed_audio = tf.multiply(audio, audio_ratio) + tf.multiply(mixed_noise, noise_ratio)
    #save to wav file              
    final_pcm = contrib_audio.encode_wav(mixed_audio,16000)
    tf.print(final_pcm,output_stream="file:///tmp/test.wav",summarize=-1)
    return mixed_audio
    #return tf.multiply(audio, audio_ratio) + tf.multiply(mixed_noise, noise_ratio)

but two problems i am facing

i am not able to change parameter of output_stream dynamically so that multiple wave file is saved. 2.Files size keeps growing so we have to stop training ctrl+c after few steps.

anyway if i listen the audio , i dont think noise is getting augmented to the speech at all.

Feb 20 '20 12:02 alokprasad

@alokprasad I tried tf.print and listened the audio, it's really augmented, maybe my default parameters are too conservative (because some noise data are "speech noise", I don't know what would them cause if too loud), and the process will not augment every single audio time step, but just randomly augment an interval for each audio, and many intervals in noise file are actually silence. Don't forget to delete test.wav before each execution, or you will always hear the same output. Try an extreme example: --audio_aug_mix_noise_max_noise_db=5, --audio_aug_mix_noise_min_noise_db=10, to make sure the noise does exist.

Here is another tip, you can also try --audio_aug_mix_noise_max_audio_db=10, this could simulate microphone over boosted sound effect.

Feb 21 '20 02:02 mychiux413

@mychiux413 "process will not augment every single audio time step, but just randomly augment an interval for each audio" I think this might not produce good result , i think each interval should be mixed with noise.( i.e complete file should be mixed with noise)

Infact it would be good that same audio is fed twice to the network

mixed with noise
without noise.

i have added a flag in transcript.csv file with extra flag "noise_flag" whose value is 0 or 1 . eg.csv file will have follwing

wav_filename,wav_filesize,transcript,noise_flag
test1.wav,3423,"where are you?",1
test1.wav,3423,"where are you?",0

1 is to mix noise and 0 donto mix noise.

relevant code changes

if train_phase and noise_iterator :
        audio = tf.cond(noise_flag > 0 ,
            lambda:augment_noise(
            audio,
            noise_iterator.get_next(),
            change_audio_db_max=FLAGS.audio_aug_mix_noise_max_audio_db,
            change_audio_db_min=FLAGS.audio_aug_mix_noise_min_audio_db,
            change_noise_db_max=FLAGS.audio_aug_mix_noise_max_noise_db,
            change_noise_db_min=FLAGS.audio_aug_mix_noise_min_noise_db,
        )

            ),
            lambda:audio)

Feb 21 '20 06:02 alokprasad

@alokprasad But while the uniform() pick a lower noise_ratio like -35 db, I will consider the noise approximately as none, so why should we have to keep "clean audio" for each epoch? And I checked the Baidu's DeepSpeech add_noise(), they mixed complete file like you said, so I will modify it.

Feb 21 '20 10:02 mychiux413

@mychiux413 How are you handling case , if Noise file len is less than speech file len.?

Feb 25 '20 05:02 alokprasad

@alokprasad To prevent this, I just repeat the noise file to make the duration over than speech file, this might make some continuous environment noise(like street) audio have discontinuous points, but most of the time, it wont happen, because bin/normalize_noise_audio.py can split noise files into 30s around or even longer.

And an another repeat reason is, some of point source noise(like bell) files are short, but they are OK to be repeated.

Feb 25 '20 07:02 mychiux413

@mychiux413 , i figured that out, btw what DBFS you think might be good for running training on rnnoise after normalizing.

Feb 26 '20 03:02 alokprasad

@mychiux413 one suggestion , i think it would be better to check SNR during training if its too bad dynamic adjust the DBFS so that gain of noise is not too much that it hides speech signal.

Mar 02 '20 05:03 alokprasad

@alokprasad

For current version, The parameters --audio_aug_mix_noise_max_noise_db -5, --audio_aug_mix_noise_min_noise_db -35, --audio_aug_mix_noise_max_audio_db 5, --audio_aug_mix_noise_min_audio_db -10 could get a good result for me.
Yes, use SNR should be more stable, but considering the performance, I would try to cache every single speech/noise file's DBFS at the beginning, thus we don't have to recalculate every DBFS for each training step to estimate SNR.
Furthermore, I want to add options to mix dev/test noises into dev/test files, I think it could indicate how strong is the model defending noise instead of just testing clean speech data, also to make sure the model doesn't overfit the specific noise environment. (note: dev/test noise dir should be different to train noise dir).

Mar 04 '20 07:03 mychiux413

@mychiux413 is this the change that calculates SNR and adjust gain? https://github.com/mozilla/DeepSpeech/pull/2622/commits/2269514a9ef676100b46f0c99c0e6a7150feb4dd How is the audio generated , did u got chance to see using tf.print outputstream option.

Mar 10 '20 09:03 alokprasad

What do you think about using a csv file (formatted like the training csv files) as input instead of a directory. I think in this way you could:

Use the augmentation features from the training pipeline also with your noise pipeline
Use the speech data instead of the noise data (cocktail party background noise)
Maybe use both of the above pipelines for augmentation

Mar 10 '20 10:03 DanBmh

@alokprasad I'm still developing, there are still some issues now, please do not use the commit.

@DanBmh I will make the arguments also accept csv files for cocktail party purpose.

And about the augmentation pipelines, here is the data pipeline below ** Current Pipeline ** [noise] filename -> wav ↴ [ train ] filename -> wav -> mixed audio↴ -> spectrogram(aug) ↴-> mfcc(aug) -> input [tensorboard] ___________________audio review _______ approximate audio review

** Noise Aug Pipeline ** [noise] filename -> wav -> spectrogram(aug) -> mfcc(aug)↴ [ train ] filename -> wav -> spectrogram(aug) -> mfcc(aug) -> mixed mfcc -> input [tensorboard]

This will cause several problems:

With Noise Aug Pipeline, we have to prove the superposition between audio or mfcc are equivalent in math, or anyone knows the answer, please reply us.
After spectrogram process, the signal phase term has been lost, if we want to reconstruct the audio back for reviewing purpose, the audio must sound awful (heavy distortion), which is also a major topic in TTS like tacotron, anyway, if we want to dump a clear audio out, Current Pipeline should be the best option.
Continuing the last issue above, We can also augment pitch, tempo just in wav process, but pitch code would be quite different from spectrogram_augmentations.py and might cost large cpu computation during prefetch phase due to fft conversion.

Mar 11 '20 07:03 mychiux413

Update, specify the dbfs and S/R to determine the balance of audio/noise, and support csv files for cocktail party purpose.

Now, we can select noise files by directory or csv files with --audio_aug_mix_noise_train_dirs_or_files, --audio_aug_mix_noise_dev_dirs_or_files, --audio_aug_mix_noise_test_dirs_or_files to validate how tough is the model defending noise.
The final audio volume is random choosed between --audio_aug_mix_noise_min_audio_dbfs and --audio_aug_mix_noise_max_audio_dbfs
The final noise volume is relatively determined by --audio_aug_mix_noise_min_snr_db and --audio_aug_mix_noise_max_snr_db and target audio volume.
Use --audio_aug_mix_noise_limit_audio_peak_dbfs and --audio_aug_mix_noise_limit_noise_peak_dbfs to protect drastic volume variation, if we gain the volume only depends on average dbfs of the audio, the peak part of signal might be drastically over boosted.
Use --augmentation_review_audio_steps to listen the augmented audio in tensorboard, but the tensorboard can only exhibit 10 audios in one panel, I don't know how to set it, and tensorboard always normalize the volume of dumped audio, no matter how low is the volume you dumped it, if --summary_dir is not specified, we can review the augmented audio in default directory:

tensorboard --logdir ~/.share/local/deepspeech/summaries/

Do NOT use --augmentation_review_audio_steps with spectrogram augmentation in this commit, because this branch was based on a wrong spectrogram augmentation code, the process will not run correctly.
An extreme Example to make sure your audio is mixed:

python -u DeepSpeech.py --noshow_progressbar \
  --train_files data/ldc93s1/ldc93s1.csv \
  --dev_files data/ldc93s1/ldc93s1.csv \
  --test_files data/ldc93s1/ldc93s1.csv \
  --n_hidden 100 \
  --audio_aug_mix_noise_train_dirs_or_files <directory-path1>,<csv-path1>,<directory-path2> \
  --audio_aug_mix_noise_dev_dirs_or_files <directory-path1>,<csv-path1>,<directory-path2> \
  --audio_aug_mix_noise_test_dirs_or_files <directory-path1>,<csv-path1>,<directory-path2> \
  --audio_aug_mix_noise_min_snr_db 0.1 \
  --audio_aug_mix_noise_max_snr_db 0.2 \
  --audio_aug_mix_noise_min_audio_dbfs -0.2 \
  --audio_aug_mix_noise_max_audio_dbfs -0.1 \
  --audio_aug_mix_noise_limit_audio_peak_dbfs 100 \
  --audio_aug_mix_noise_limit_noise_peak_dbfs 100 \
  --augmentation_review_audio_steps 10 \
  "$@"

I set the default parameters for non speech noise environment, if we want to train the cocktail party environment, try decreasing the --audio_aug_mix_noise_min_snr_db and --audio_aug_mix_noise_max_snr_db.

Mar 16 '20 09:03 mychiux413

What do you think about two noise pipelines? One for the noise and one for cocktailparty speech.
I thougth about mixing my files together into one pipeline, but i think it would be better to have seperate mixing parameters for noise and speech. Mostly because you can mix the noise much louder than than the speech while keeping the text understandable.

Mar 23 '20 14:03 DanBmh

What do you think about two noise pipelines? One for the noise and one for cocktailparty speech. I thougth about mixing my files together into one pipeline, but i think it would be better to have seperate mixing parameters for noise and speech. Mostly because you can mix the noise much louder than than the speech while keeping the text understandable.

Yes, it make sense, I will try it, but the arguments would be twice than previous version, and how about specifying the number of sub-speakers for each speech? is this helpful for your experiments?

Mar 25 '20 06:03 mychiux413

Yes, it make sense, I will try it, but the arguments would be twice than previous version, and how about specifying the number of sub-speakers for each speech? is this helpful for your experiments?

Do you mean augmenting with not only one but multiple background speech or noise files at once? If you dont think its to complicated this is an interesting idea. It would make the backgound noises even more realistic. In this case I would suggest to make the number not fixed, but random with an upper boundary to simulate different environments.

Mar 25 '20 10:03 DanBmh

Here’s a question: is it necessary to run augmentation on every epoch? It seems like augmentation is probably more valuable as the model nears convergence. I wonder if you could balance out the performance hit by not augmenting the first x epochs, when the model still has a high WER.

Mar 28 '20 06:03 dabinat

Here’s a question: is it necessary to run augmentation on every epoch? It seems like augmentation is probably more valuable as the model nears convergence. I wonder if you could balance out the performance hit by not augmenting the first x epochs, when the model still has a high WER.

Here is my recent experiment result below (continuing...), I trained 20 epochs for every model with different parameters

noise file: rnnoise, pointsources noise
train dataset: librivox clean-100.csv clean-300.csv other-500.csv
test dataset: test-clean.csv
the loss records are final step (epoch = 19)
in addition to this, I also mixed the zh-tw speech into librivox, and test the WER.

Name	min_audio_dbfs	max_audio_dbfs	min_snr_db	max_snr_db	limit_audio_peak_dbfs	limit_noise_peak_dbfs	train loss	dev loss	test loss	test wer	test loss (mix TW speech)	test wer (mix TW speech)
Baseline (No Augmentation)							27.685342	24.046401	23.756416	0.137232	121.442734	0.454246
Default mix noise	0	-35	3	30	7	3	69.323678	21.669104	21.383959	0.112958	60.703743	0.270337
speech non over boosted	0	-35	3	30	0	3	64.432057	21.491052	21.344168	0.11471	60.352631	0.261519
noise non over boosted	0	-35	3	30	7	0	66.458655	21.09868	21.09868	0.111596	62.270283	0.269928
Wide speech volume	0	-45	3	30	7	3	67.366901	21.060449	20.68895	0.116559	59.696766	0.2673

The result shows:

whatever the noise parameters are, the tests WER are always better than the test of "No Aug model"
The performance of defending noise (column test wer (mix TW speech)) is very effective with mix noise training
Don't be misled by training loss when mix with noise, because the space of data coverage is large than no-aug.
To inspect the noise mix training, the parameters might lead some trade off here, if we want to enhance cocktail party speech, you might lose some accuracy in clean test, in my opinion, if we skip first x epochs to emphasize the clean environment, which should be equivalent to increase max SNR, so the noise test should be worse then.

So my conclusion is:

Tuning the noise parameters according to your target application environment, which should be equivalent as tuning skip first x epochs.
Of course I will also try your idea if I have free resources later.

Mar 31 '20 04:03 mychiux413

@mychiux413 How you are generating test samples ,is it natural voice with noisy background or you have mixed clean speech with noise and then using it as test wave?

Mar 31 '20 05:03 alokprasad

DeepSpeech DeepSpeech copied to clipboard

online mix noise audio data in training step

DeepSpeech
DeepSpeech copied to clipboard