DeepSpeech icon indicating copy to clipboard operation
DeepSpeech copied to clipboard

online mix noise audio data in training step

Open mychiux413 opened this issue 5 years ago • 42 comments

Mixing noisy data into training file before runtime could cause data monotonicity, but mixing noisy data in runtime could cause very bad performance, if we read each noise audio to augment each training row. (Ex: for HDD disk, the duration of mixing one audio is almost 100 times slower than freq_time_mask does).

To reduce online mixing time, I use another tf.Dataset to cache noise audio array, then mix them to training data.

usage:

python -u DeepSpeech.py --noshow_progressbar \
  --train_files data/ldc93s1/ldc93s1.csv \
  --test_files data/ldc93s1/ldc93s1.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 200 \
  --epochs 200 \
  --checkpoint_dir <checkpoint_dir> \
  --audio_aug_mix_noise_walk_dirs <directory1-contains-wav-files>,<directory2-contains-wav-files>
  • Just specify the noise file directory, the process will automatically walk through the whole directory recursively, and collect .wav files (but it doesn't checkout the sample rate).
  • This program assume every volume of noise audio have been maximized, to save the calculation time of each speech/noise volume balance, it just simply divide speech audio with value between 0~-10 db, and divide noise audio with value between -25~-50 db
  • The augment time can be as fast as freq_time_mask
  • --audio_aug_mix_noise_walk_dirs can set multi dirs with comma separated.

To manually adjust volume loudness suppression:

python -u DeepSpeech.py \
...
--audio_aug_mix_noise_max_noise_db -25 \
--audio_aug_mix_noise_min_noise_db -50 \
--audio_aug_mix_noise_max_audio_db 0 \
--audio_aug_mix_noise_min_audio_db -10 \
...
  • If your noise files are pure non-speaker noise, my experience paramters is --audio_aug_mix_noise_max_noise_db -15, --audio_aug_mix_noise_min_noise_db -25
  • If your noise files are from speakers, like cocktail party, my experience paramters is --audio_aug_mix_noise_max_noise_db -30, --audio_aug_mix_noise_min_noise_db -50, otherwise, the voice can have a chance to cover the main speaker's volume.
  • If you want to cache audio array into local disk, set --audio_aug_mix_noise_cache <your cache path>, otherwise cache in memory.

mychiux413 avatar Dec 31 '19 09:12 mychiux413

No Taskcluster jobs started for this pull request

The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.

I tested it with Freesound Dataset Kaggle 2019 which has about 103h of noise data. Everything worked as intended. Only i didnt see a great difference in my training results (using Voxforge DE dataset). Maybe its too small.

DanBmh avatar Feb 05 '20 17:02 DanBmh

I tested it with Freesound Dataset Kaggle 2019 which has about 103h of noise data. Everything worked as intended. Only i didnt see a great difference in my training results (using Voxforge DE dataset). Maybe its too small.

DId u mean.. the noise dataset is small or Voxforge dataset is small, comparatively. One suggestion: If your feel noise dataset is small, you can use rnnoise's (https://people.xiph.org/~jm/demo/rnnoise/rnnoise_contributions.tar.gz) dataset

alokprasad avatar Feb 06 '20 08:02 alokprasad

I did mean the voxforge dataset. It has only around 32h of speech data.

i think rnnoise dataset is smaller than the freesound one (6 vs 22 gb, did not find the length in hours).

Also the noise files of rnnoise are in .raw format and freesound already has .wav format. So you need to convert them before to wav somehow.

DanBmh avatar Feb 06 '20 12:02 DanBmh

To use rnnoise datasets, we should normalize the volume and convert frame rate to 16000 manually, and many of rnnoise audio are almost no sound without normalizing volume. This mix noise process assume every single noise file volume were maximized, so it doesn't calculate dBFS to balance speech/noise volume when processing.

mychiux413 avatar Feb 12 '20 09:02 mychiux413

@mychiux413 any idea how can be this done ? it should be online process ?

alokprasad avatar Feb 12 '20 12:02 alokprasad

@mychiux413 any idea how can be this done ? it should be online process ?

you should prepare normalized noise files by yourself before training start.

there is no standard way to normalize volume, I can only offer an example for you, you can optimize the script by yourself, and don't forget to listen the output audio to make sure everything sounds well.

notice:

  1. I use pydub in the example, before pip install pydub, you should install ffmpeg by sudo apt-get install ffmpeg
  2. the raw data I've downloaded from rnnoise is .raw, which should be manually specified frame rate, sample size, channel
  3. some rnnoise data duration are almost 5 minutes, which is unnecessary in online mixing, so the example split them into 30 secs around.
  4. the script is under python environment 3.7 (typing supported)

usage:

python <python_file.py> --from_dir <directory include rnnoise data> --to_dir <directory to output normalized data>
from __future__ import absolute_import, division, print_function
from pydub import AudioSegment
from multiprocessing import Pool
from functools import partial
import math
import argparse
import sys
import os


def detect_silence(sound: AudioSegment, silence_threshold=-50.0,
                   chunk_size=10) -> (int, int):
    start_trim = 0  # ms
    sound_size = len(sound)
    assert chunk_size > 0  # to avoid infinite loop
    while sound[start_trim:(
            start_trim +
            chunk_size)].dBFS < silence_threshold and start_trim < sound_size:
        start_trim += chunk_size

    end_trim = sound_size
    while sound[(end_trim - chunk_size):end_trim].dBFS < silence_threshold \
            and end_trim > 0:
        end_trim -= chunk_size

    start_trim = min(sound_size, start_trim)
    end_trim = max(0, end_trim)

    return min([start_trim, end_trim]), max([start_trim, end_trim])


def trim_silence_audio(sound: AudioSegment,
                       silence_threshold=-50.0,
                       chunk_size=10) -> AudioSegment:
    start_trim, end_trim = detect_silence(sound, silence_threshold, chunk_size)
    return sound[start_trim:end_trim]


def convert(filename: str, src_dir: str, dst_dirpath: str, dirpath: str,
            normalize: bool, trim_silence: bool, min_duration_seconds: float,
            max_duration_seconds: float):
    if not filename.endswith(('.wav', '.raw')):
        return
    filepath = os.path.join(dirpath, filename)
    if filename.endswith('.wav'):
        sound: AudioSegment = AudioSegment.from_file(filepath)
    else:
        try:
            sound: AudioSegment = AudioSegment.from_raw(filepath,
                                                        sample_width=2,
                                                        frame_rate=44100,
                                                        channels=1)
        except Exception as err:
            print('[retry] {}'.format(err))
            try:
                sound: AudioSegment = AudioSegment.from_raw(filepath,
                                                            sample_width=2,
                                                            frame_rate=48000,
                                                            channels=1)
            except Exception as err:
                print('bypass audio {}, got error: {}'.format(filepath, err))
                return
        try:
            sound = sound.set_frame_rate(16000)
        except Exception as err:
            print('[bypass] {}'.format(err))
            return

    n_splits: int = max(
        1, math.floor(sound.duration_seconds / max_duration_seconds))
    chunk_duration_ms = math.ceil(len(sound) / n_splits)
    chunks = []
    for i in range(n_splits):
        end_ms = min((i + 1) * chunk_duration_ms, len(sound))
        chunk = sound[(i * chunk_duration_ms):end_ms]
        chunks.append(chunk)
    for i, chunk in enumerate(chunks):
        dst_path = os.path.join(dst_dirpath, str(i) + '_' + filename)
        if dst_path.endswith('.raw'):
            dst_path = dst_path[:-4] + '.wav'
        if os.path.exists(dst_path):
            print('audio exists: {}'.format(dst_path))
            return
        if normalize:
            chunk = chunk.normalize()
            if chunk.dBFS < -30.0:
                chunk = chunk.compress_dynamic_range().normalize()
            if chunk.dBFS < -30.0:
                chunk = chunk.compress_dynamic_range().normalize()
        if trim_silence:
            chunk = trim_silence_audio(chunk)

        if chunk.duration_seconds < min_duration_seconds:
            return
        chunk.export(dst_path, format='wav')


def main(src_dir: str,
         dst_dir: str,
         min_duration_seconds: float,
         max_duration_seconds: float,
         normalize=True,
         trim_silence=True):
    assert os.path.exists(src_dir)
    if not os.path.exists(dst_dir):
        os.makedirs(dst_dir, exist_ok=False)
    src_dir = os.path.abspath(src_dir)
    dst_dir = os.path.abspath(dst_dir)

    # n_data = 0
    for dirpath, _, filenames in os.walk(src_dir):
        dirpath = os.path.abspath(dirpath)
        dst_dirpath = os.path.join(dst_dir,
                                   dirpath.replace(src_dir, '').lstrip('/'))
        print('converting dirpath: {} -> {}'.format(dirpath, dst_dirpath))
        if not os.path.exists(dst_dirpath):
            os.makedirs(dst_dirpath, exist_ok=False)

        convert_func = partial(convert,
                               src_dir=src_dir,
                               dst_dirpath=dst_dirpath,
                               dirpath=dirpath,
                               normalize=normalize,
                               trim_silence=trim_silence,
                               min_duration_seconds=min_duration_seconds,
                               max_duration_seconds=max_duration_seconds)
        p = Pool()
        p.map(convert_func, filenames)


if __name__ == "__main__":
    PARSER = argparse.ArgumentParser(description='Optimize noise files')
    PARSER.add_argument('--from_dir',
                        help='Convert wav from directory',
                        type=str)
    PARSER.add_argument('--to_dir', help='save wav to directory', type=str)
    PARSER.add_argument('--min_sec',
                        help='min duration seconds of saved file',
                        type=float,
                        default=1.0)
    PARSER.add_argument('--max_sec',
                        help='max duration seconds of saved file',
                        type=float,
                        default=30.0)
    PARSER.add_argument('--normalize',
                        action='store_true',
                        help='Trim silence, default is true',
                        default=True)
    PARSER.add_argument('--trim',
                        action='store_true',
                        help='Trim silence, default is true',
                        default=True)
    PARAMS = PARSER.parse_args()

    main(PARAMS.from_dir, PARAMS.to_dir, PARAMS.min_sec, PARAMS.max_sec,
         PARAMS.normalize, PARAMS.trim)

mychiux413 avatar Feb 17 '20 07:02 mychiux413

there is no standard way to normalize volume, I can only offer an example for you, you can optimize the script by yourself, and don't forget to listen the output audio to make sure everything sounds well.

Could you add this script to your pull request?

I added a progressbar and a summary to it, feel free to copy it back. The updated code is here: https://github.com/DanBmh/deepspeech-german/blob/master/data/normalize_noise_audio.py

DanBmh avatar Feb 17 '20 14:02 DanBmh

I added bin/normalize_noise_audio.py, and did some modifications:

  1. Removed typing for environment compatibility
  2. Fixed pylint error, added warning message for ImportError of tqdm & pydub, because they are not standard packages in requirement.txt
  3. Replaced seconds_to_hours() with util/feeding.py::secs_to_hours()

Usage:

python bin/normalize_noise_audio.py --from_dir <directory include noise data> --to_dir <directory to output normalized data>

mychiux413 avatar Feb 19 '20 02:02 mychiux413

@mychiux413 anyway we can dump the mixed files and see how effective is the mixing of noise to speech file.just to make sure mixing is proper

alokprasad avatar Feb 20 '20 07:02 alokprasad

@alokprasad You're right, in fact, all the augmented audio should be able to be reviewed in pipeline, even augment on spectrogram like pitch/tempo/mask..., or we would not have a concept to tune the proper parameters. But in tensorflow's pipeline, it's not as simple as offline augmentation does, we should dump audio data into tensorboard by tf.summary.audio, and I'm still study this method, also trying to figure out how much refactoring this will affect.

mychiux413 avatar Feb 20 '20 09:02 mychiux413

@mychiux413 I also tried to save the audio using tf.print 's output_stream option in following function

"def augment_noise"
    noise_ratio = tf.math.pow(10.0, choosen_noise_db / 10)
    mixed_audio = tf.multiply(audio, audio_ratio) + tf.multiply(mixed_noise, noise_ratio)
    #save to wav file              
    final_pcm = contrib_audio.encode_wav(mixed_audio,16000)
    tf.print(final_pcm,output_stream="file:///tmp/test.wav",summarize=-1)
    return mixed_audio
    #return tf.multiply(audio, audio_ratio) + tf.multiply(mixed_noise, noise_ratio)

but two problems i am facing

  1. i am not able to change parameter of output_stream dynamically so that multiple wave file is saved. 2.Files size keeps growing so we have to stop training ctrl+c after few steps.

anyway if i listen the audio , i dont think noise is getting augmented to the speech at all.

alokprasad avatar Feb 20 '20 12:02 alokprasad

@alokprasad I tried tf.print and listened the audio, it's really augmented, maybe my default parameters are too conservative (because some noise data are "speech noise", I don't know what would them cause if too loud), and the process will not augment every single audio time step, but just randomly augment an interval for each audio, and many intervals in noise file are actually silence. Don't forget to delete test.wav before each execution, or you will always hear the same output. Try an extreme example: --audio_aug_mix_noise_max_noise_db=5, --audio_aug_mix_noise_min_noise_db=10, to make sure the noise does exist.

Here is another tip, you can also try --audio_aug_mix_noise_max_audio_db=10, this could simulate microphone over boosted sound effect.

mychiux413 avatar Feb 21 '20 02:02 mychiux413

@mychiux413 "process will not augment every single audio time step, but just randomly augment an interval for each audio" I think this might not produce good result , i think each interval should be mixed with noise.( i.e complete file should be mixed with noise)

Infact it would be good that same audio is fed twice to the network

  1. mixed with noise
  2. without noise.

i have added a flag in transcript.csv file with extra flag "noise_flag" whose value is 0 or 1 . eg.csv file will have follwing

wav_filename,wav_filesize,transcript,noise_flag
test1.wav,3423,"where are you?",1
test1.wav,3423,"where are you?",0

1 is to mix noise and 0 donto mix noise.

relevant code changes

if train_phase and noise_iterator :
        audio = tf.cond(noise_flag > 0 ,
            lambda:augment_noise(
            audio,
            noise_iterator.get_next(),
            change_audio_db_max=FLAGS.audio_aug_mix_noise_max_audio_db,
            change_audio_db_min=FLAGS.audio_aug_mix_noise_min_audio_db,
            change_noise_db_max=FLAGS.audio_aug_mix_noise_max_noise_db,
            change_noise_db_min=FLAGS.audio_aug_mix_noise_min_noise_db,
        )

            ),
            lambda:audio)    

alokprasad avatar Feb 21 '20 06:02 alokprasad

@alokprasad But while the uniform() pick a lower noise_ratio like -35 db, I will consider the noise approximately as none, so why should we have to keep "clean audio" for each epoch? And I checked the Baidu's DeepSpeech add_noise(), they mixed complete file like you said, so I will modify it.

mychiux413 avatar Feb 21 '20 10:02 mychiux413

@mychiux413 How are you handling case , if Noise file len is less than speech file len.?

alokprasad avatar Feb 25 '20 05:02 alokprasad

@alokprasad To prevent this, I just repeat the noise file to make the duration over than speech file, this might make some continuous environment noise(like street) audio have discontinuous points, but most of the time, it wont happen, because bin/normalize_noise_audio.py can split noise files into 30s around or even longer.

And an another repeat reason is, some of point source noise(like bell) files are short, but they are OK to be repeated.

mychiux413 avatar Feb 25 '20 07:02 mychiux413

@mychiux413 , i figured that out, btw what DBFS you think might be good for running training on rnnoise after normalizing.

alokprasad avatar Feb 26 '20 03:02 alokprasad

@mychiux413 one suggestion , i think it would be better to check SNR during training if its too bad dynamic adjust the DBFS so that gain of noise is not too much that it hides speech signal.

alokprasad avatar Mar 02 '20 05:03 alokprasad

@alokprasad

  • For current version, The parameters --audio_aug_mix_noise_max_noise_db -5, --audio_aug_mix_noise_min_noise_db -35, --audio_aug_mix_noise_max_audio_db 5, --audio_aug_mix_noise_min_audio_db -10 could get a good result for me.
  • Yes, use SNR should be more stable, but considering the performance, I would try to cache every single speech/noise file's DBFS at the beginning, thus we don't have to recalculate every DBFS for each training step to estimate SNR.
  • Furthermore, I want to add options to mix dev/test noises into dev/test files, I think it could indicate how strong is the model defending noise instead of just testing clean speech data, also to make sure the model doesn't overfit the specific noise environment. (note: dev/test noise dir should be different to train noise dir).

mychiux413 avatar Mar 04 '20 07:03 mychiux413

@mychiux413 is this the change that calculates SNR and adjust gain? https://github.com/mozilla/DeepSpeech/pull/2622/commits/2269514a9ef676100b46f0c99c0e6a7150feb4dd How is the audio generated , did u got chance to see using tf.print outputstream option.

alokprasad avatar Mar 10 '20 09:03 alokprasad

What do you think about using a csv file (formatted like the training csv files) as input instead of a directory. I think in this way you could:

  • Use the augmentation features from the training pipeline also with your noise pipeline
  • Use the speech data instead of the noise data (cocktail party background noise)
  • Maybe use both of the above pipelines for augmentation

DanBmh avatar Mar 10 '20 10:03 DanBmh

@alokprasad I'm still developing, there are still some issues now, please do not use the commit.

@DanBmh I will make the arguments also accept csv files for cocktail party purpose.

And about the augmentation pipelines, here is the data pipeline below ** Current Pipeline ** [noise] filename -> wav ↴ [ train ] filename -> wav -> mixed audio↴ -> spectrogram(aug) ↴-> mfcc(aug) -> input [tensorboard] ___________________audio review _______ approximate audio review

** Noise Aug Pipeline ** [noise] filename -> wav -> spectrogram(aug) -> mfcc(aug)↴ [ train ] filename -> wav -> spectrogram(aug) -> mfcc(aug) -> mixed mfcc -> input [tensorboard]

This will cause several problems:

  • With Noise Aug Pipeline, we have to prove the superposition between audio or mfcc are equivalent in math, or anyone knows the answer, please reply us.
  • After spectrogram process, the signal phase term has been lost, if we want to reconstruct the audio back for reviewing purpose, the audio must sound awful (heavy distortion), which is also a major topic in TTS like tacotron, anyway, if we want to dump a clear audio out, Current Pipeline should be the best option.
  • Continuing the last issue above, We can also augment pitch, tempo just in wav process, but pitch code would be quite different from spectrogram_augmentations.py and might cost large cpu computation during prefetch phase due to fft conversion.

mychiux413 avatar Mar 11 '20 07:03 mychiux413

Update, specify the dbfs and S/R to determine the balance of audio/noise, and support csv files for cocktail party purpose.

  • Now, we can select noise files by directory or csv files with --audio_aug_mix_noise_train_dirs_or_files, --audio_aug_mix_noise_dev_dirs_or_files, --audio_aug_mix_noise_test_dirs_or_files to validate how tough is the model defending noise.
  • The final audio volume is random choosed between --audio_aug_mix_noise_min_audio_dbfs and --audio_aug_mix_noise_max_audio_dbfs
  • The final noise volume is relatively determined by --audio_aug_mix_noise_min_snr_db and --audio_aug_mix_noise_max_snr_db and target audio volume.
  • Use --audio_aug_mix_noise_limit_audio_peak_dbfs and --audio_aug_mix_noise_limit_noise_peak_dbfs to protect drastic volume variation, if we gain the volume only depends on average dbfs of the audio, the peak part of signal might be drastically over boosted.
  • Use --augmentation_review_audio_steps to listen the augmented audio in tensorboard, but the tensorboard can only exhibit 10 audios in one panel, I don't know how to set it, and tensorboard always normalize the volume of dumped audio, no matter how low is the volume you dumped it, if --summary_dir is not specified, we can review the augmented audio in default directory:
tensorboard --logdir ~/.share/local/deepspeech/summaries/
  • Do NOT use --augmentation_review_audio_steps with spectrogram augmentation in this commit, because this branch was based on a wrong spectrogram augmentation code, the process will not run correctly.

  • An extreme Example to make sure your audio is mixed:

python -u DeepSpeech.py --noshow_progressbar \
  --train_files data/ldc93s1/ldc93s1.csv \
  --dev_files data/ldc93s1/ldc93s1.csv \
  --test_files data/ldc93s1/ldc93s1.csv \
  --n_hidden 100 \
  --audio_aug_mix_noise_train_dirs_or_files <directory-path1>,<csv-path1>,<directory-path2> \
  --audio_aug_mix_noise_dev_dirs_or_files <directory-path1>,<csv-path1>,<directory-path2> \
  --audio_aug_mix_noise_test_dirs_or_files <directory-path1>,<csv-path1>,<directory-path2> \
  --audio_aug_mix_noise_min_snr_db 0.1 \
  --audio_aug_mix_noise_max_snr_db 0.2 \
  --audio_aug_mix_noise_min_audio_dbfs -0.2 \
  --audio_aug_mix_noise_max_audio_dbfs -0.1 \
  --audio_aug_mix_noise_limit_audio_peak_dbfs 100 \
  --audio_aug_mix_noise_limit_noise_peak_dbfs 100 \
  --augmentation_review_audio_steps 10 \
  "$@"
  • I set the default parameters for non speech noise environment, if we want to train the cocktail party environment, try decreasing the --audio_aug_mix_noise_min_snr_db and --audio_aug_mix_noise_max_snr_db.

mychiux413 avatar Mar 16 '20 09:03 mychiux413

What do you think about two noise pipelines? One for the noise and one for cocktailparty speech.
I thougth about mixing my files together into one pipeline, but i think it would be better to have seperate mixing parameters for noise and speech. Mostly because you can mix the noise much louder than than the speech while keeping the text understandable.

DanBmh avatar Mar 23 '20 14:03 DanBmh

What do you think about two noise pipelines? One for the noise and one for cocktailparty speech. I thougth about mixing my files together into one pipeline, but i think it would be better to have seperate mixing parameters for noise and speech. Mostly because you can mix the noise much louder than than the speech while keeping the text understandable.

Yes, it make sense, I will try it, but the arguments would be twice than previous version, and how about specifying the number of sub-speakers for each speech? is this helpful for your experiments?

mychiux413 avatar Mar 25 '20 06:03 mychiux413

Yes, it make sense, I will try it, but the arguments would be twice than previous version, and how about specifying the number of sub-speakers for each speech? is this helpful for your experiments?

Do you mean augmenting with not only one but multiple background speech or noise files at once? If you dont think its to complicated this is an interesting idea. It would make the backgound noises even more realistic. In this case I would suggest to make the number not fixed, but random with an upper boundary to simulate different environments.

DanBmh avatar Mar 25 '20 10:03 DanBmh

Here’s a question: is it necessary to run augmentation on every epoch? It seems like augmentation is probably more valuable as the model nears convergence. I wonder if you could balance out the performance hit by not augmenting the first x epochs, when the model still has a high WER.

dabinat avatar Mar 28 '20 06:03 dabinat

Here’s a question: is it necessary to run augmentation on every epoch? It seems like augmentation is probably more valuable as the model nears convergence. I wonder if you could balance out the performance hit by not augmenting the first x epochs, when the model still has a high WER.

Here is my recent experiment result below (continuing...), I trained 20 epochs for every model with different parameters

  • noise file: rnnoise, pointsources noise
  • train dataset: librivox clean-100.csv clean-300.csv other-500.csv
  • test dataset: test-clean.csv
  • the loss records are final step (epoch = 19)
  • in addition to this, I also mixed the zh-tw speech into librivox, and test the WER.
Name min_audio_dbfs max_audio_dbfs min_snr_db max_snr_db limit_audio_peak_dbfs limit_noise_peak_dbfs train loss dev loss test loss test wer test loss (mix TW speech) test wer (mix TW speech)
Baseline (No Augmentation) 27.685342 24.046401 23.756416 0.137232 121.442734 0.454246
Default mix noise 0 -35 3 30 7 3 69.323678 21.669104 21.383959 0.112958 60.703743 0.270337
speech non over boosted 0 -35 3 30 0 3 64.432057 21.491052 21.344168 0.11471 60.352631 0.261519
noise non over boosted 0 -35 3 30 7 0 66.458655 21.09868 21.09868 0.111596 62.270283 0.269928
Wide speech volume 0 -45 3 30 7 3 67.366901 21.060449 20.68895 0.116559 59.696766 0.2673

The result shows:

  1. whatever the noise parameters are, the tests WER are always better than the test of "No Aug model"
  2. The performance of defending noise (column test wer (mix TW speech)) is very effective with mix noise training
  3. Don't be misled by training loss when mix with noise, because the space of data coverage is large than no-aug.
  4. To inspect the noise mix training, the parameters might lead some trade off here, if we want to enhance cocktail party speech, you might lose some accuracy in clean test, in my opinion, if we skip first x epochs to emphasize the clean environment, which should be equivalent to increase max SNR, so the noise test should be worse then.

So my conclusion is:

  • Tuning the noise parameters according to your target application environment, which should be equivalent as tuning skip first x epochs.
  • Of course I will also try your idea if I have free resources later.

mychiux413 avatar Mar 31 '20 04:03 mychiux413

@mychiux413 How you are generating test samples ,is it natural voice with noisy background or you have mixed clean speech with noise and then using it as test wave?

alokprasad avatar Mar 31 '20 05:03 alokprasad