DeepSpeech
DeepSpeech copied to clipboard
online mix noise audio data in training step
Mixing noisy data into training file before runtime could cause data monotonicity, but mixing noisy data in runtime could cause very bad performance, if we read each noise audio to augment each training row. (Ex: for HDD disk, the duration of mixing one audio is almost 100 times slower than freq_time_mask does).
To reduce online mixing time, I use another tf.Dataset to cache noise audio array, then mix them to training data.
usage:
python -u DeepSpeech.py --noshow_progressbar \
--train_files data/ldc93s1/ldc93s1.csv \
--test_files data/ldc93s1/ldc93s1.csv \
--train_batch_size 1 \
--test_batch_size 1 \
--n_hidden 200 \
--epochs 200 \
--checkpoint_dir <checkpoint_dir> \
--audio_aug_mix_noise_walk_dirs <directory1-contains-wav-files>,<directory2-contains-wav-files>
- Just specify the noise file directory, the process will automatically walk through the whole directory recursively, and collect .wav files (but it doesn't checkout the sample rate).
- This program assume every volume of noise audio have been maximized, to save the calculation time of each speech/noise volume balance, it just simply divide speech audio with value between
0~-10 db
, and divide noise audio with value between-25~-50 db
- The augment time can be as fast as freq_time_mask
-
--audio_aug_mix_noise_walk_dirs
can set multi dirs with comma separated.
To manually adjust volume loudness suppression:
python -u DeepSpeech.py \
...
--audio_aug_mix_noise_max_noise_db -25 \
--audio_aug_mix_noise_min_noise_db -50 \
--audio_aug_mix_noise_max_audio_db 0 \
--audio_aug_mix_noise_min_audio_db -10 \
...
- If your noise files are pure non-speaker noise, my experience paramters is
--audio_aug_mix_noise_max_noise_db -15
,--audio_aug_mix_noise_min_noise_db -25
- If your noise files are from speakers, like cocktail party, my experience paramters is
--audio_aug_mix_noise_max_noise_db -30
,--audio_aug_mix_noise_min_noise_db -50
, otherwise, the voice can have a chance to cover the main speaker's volume. - If you want to cache audio array into local disk, set
--audio_aug_mix_noise_cache <your cache path>
, otherwise cache in memory.
No Taskcluster jobs started for this pull request
The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.
I tested it with Freesound Dataset Kaggle 2019 which has about 103h of noise data. Everything worked as intended. Only i didnt see a great difference in my training results (using Voxforge DE dataset). Maybe its too small.
I tested it with Freesound Dataset Kaggle 2019 which has about 103h of noise data. Everything worked as intended. Only i didnt see a great difference in my training results (using Voxforge DE dataset). Maybe its too small.
DId u mean.. the noise dataset is small or Voxforge dataset is small, comparatively. One suggestion: If your feel noise dataset is small, you can use rnnoise's (https://people.xiph.org/~jm/demo/rnnoise/rnnoise_contributions.tar.gz) dataset
I did mean the voxforge dataset. It has only around 32h of speech data.
i think rnnoise dataset is smaller than the freesound one (6 vs 22 gb, did not find the length in hours).
Also the noise files of rnnoise are in .raw format and freesound already has .wav format. So you need to convert them before to wav somehow.
To use rnnoise
datasets, we should normalize the volume
and convert frame rate to 16000
manually, and many of rnnoise audio are almost no sound without normalizing volume.
This mix noise process
assume every single noise file volume were maximized, so it doesn't calculate dBFS to balance speech/noise volume when processing.
@mychiux413 any idea how can be this done ? it should be online process ?
@mychiux413 any idea how can be this done ? it should be online process ?
you should prepare normalized noise files by yourself before training start.
there is no standard way to normalize volume, I can only offer an example for you, you can optimize the script by yourself, and don't forget to listen the output audio to make sure everything sounds well.
notice:
- I use pydub in the example, before
pip install pydub
, you should install ffmpeg bysudo apt-get install ffmpeg
- the raw data I've downloaded from rnnoise is
.raw
, which should be manually specifiedframe rate
,sample size
,channel
- some rnnoise data duration are almost 5 minutes, which is unnecessary in online mixing, so the example split them into 30 secs around.
- the script is under python environment 3.7 (typing supported)
usage:
python <python_file.py> --from_dir <directory include rnnoise data> --to_dir <directory to output normalized data>
from __future__ import absolute_import, division, print_function
from pydub import AudioSegment
from multiprocessing import Pool
from functools import partial
import math
import argparse
import sys
import os
def detect_silence(sound: AudioSegment, silence_threshold=-50.0,
chunk_size=10) -> (int, int):
start_trim = 0 # ms
sound_size = len(sound)
assert chunk_size > 0 # to avoid infinite loop
while sound[start_trim:(
start_trim +
chunk_size)].dBFS < silence_threshold and start_trim < sound_size:
start_trim += chunk_size
end_trim = sound_size
while sound[(end_trim - chunk_size):end_trim].dBFS < silence_threshold \
and end_trim > 0:
end_trim -= chunk_size
start_trim = min(sound_size, start_trim)
end_trim = max(0, end_trim)
return min([start_trim, end_trim]), max([start_trim, end_trim])
def trim_silence_audio(sound: AudioSegment,
silence_threshold=-50.0,
chunk_size=10) -> AudioSegment:
start_trim, end_trim = detect_silence(sound, silence_threshold, chunk_size)
return sound[start_trim:end_trim]
def convert(filename: str, src_dir: str, dst_dirpath: str, dirpath: str,
normalize: bool, trim_silence: bool, min_duration_seconds: float,
max_duration_seconds: float):
if not filename.endswith(('.wav', '.raw')):
return
filepath = os.path.join(dirpath, filename)
if filename.endswith('.wav'):
sound: AudioSegment = AudioSegment.from_file(filepath)
else:
try:
sound: AudioSegment = AudioSegment.from_raw(filepath,
sample_width=2,
frame_rate=44100,
channels=1)
except Exception as err:
print('[retry] {}'.format(err))
try:
sound: AudioSegment = AudioSegment.from_raw(filepath,
sample_width=2,
frame_rate=48000,
channels=1)
except Exception as err:
print('bypass audio {}, got error: {}'.format(filepath, err))
return
try:
sound = sound.set_frame_rate(16000)
except Exception as err:
print('[bypass] {}'.format(err))
return
n_splits: int = max(
1, math.floor(sound.duration_seconds / max_duration_seconds))
chunk_duration_ms = math.ceil(len(sound) / n_splits)
chunks = []
for i in range(n_splits):
end_ms = min((i + 1) * chunk_duration_ms, len(sound))
chunk = sound[(i * chunk_duration_ms):end_ms]
chunks.append(chunk)
for i, chunk in enumerate(chunks):
dst_path = os.path.join(dst_dirpath, str(i) + '_' + filename)
if dst_path.endswith('.raw'):
dst_path = dst_path[:-4] + '.wav'
if os.path.exists(dst_path):
print('audio exists: {}'.format(dst_path))
return
if normalize:
chunk = chunk.normalize()
if chunk.dBFS < -30.0:
chunk = chunk.compress_dynamic_range().normalize()
if chunk.dBFS < -30.0:
chunk = chunk.compress_dynamic_range().normalize()
if trim_silence:
chunk = trim_silence_audio(chunk)
if chunk.duration_seconds < min_duration_seconds:
return
chunk.export(dst_path, format='wav')
def main(src_dir: str,
dst_dir: str,
min_duration_seconds: float,
max_duration_seconds: float,
normalize=True,
trim_silence=True):
assert os.path.exists(src_dir)
if not os.path.exists(dst_dir):
os.makedirs(dst_dir, exist_ok=False)
src_dir = os.path.abspath(src_dir)
dst_dir = os.path.abspath(dst_dir)
# n_data = 0
for dirpath, _, filenames in os.walk(src_dir):
dirpath = os.path.abspath(dirpath)
dst_dirpath = os.path.join(dst_dir,
dirpath.replace(src_dir, '').lstrip('/'))
print('converting dirpath: {} -> {}'.format(dirpath, dst_dirpath))
if not os.path.exists(dst_dirpath):
os.makedirs(dst_dirpath, exist_ok=False)
convert_func = partial(convert,
src_dir=src_dir,
dst_dirpath=dst_dirpath,
dirpath=dirpath,
normalize=normalize,
trim_silence=trim_silence,
min_duration_seconds=min_duration_seconds,
max_duration_seconds=max_duration_seconds)
p = Pool()
p.map(convert_func, filenames)
if __name__ == "__main__":
PARSER = argparse.ArgumentParser(description='Optimize noise files')
PARSER.add_argument('--from_dir',
help='Convert wav from directory',
type=str)
PARSER.add_argument('--to_dir', help='save wav to directory', type=str)
PARSER.add_argument('--min_sec',
help='min duration seconds of saved file',
type=float,
default=1.0)
PARSER.add_argument('--max_sec',
help='max duration seconds of saved file',
type=float,
default=30.0)
PARSER.add_argument('--normalize',
action='store_true',
help='Trim silence, default is true',
default=True)
PARSER.add_argument('--trim',
action='store_true',
help='Trim silence, default is true',
default=True)
PARAMS = PARSER.parse_args()
main(PARAMS.from_dir, PARAMS.to_dir, PARAMS.min_sec, PARAMS.max_sec,
PARAMS.normalize, PARAMS.trim)
there is no standard way to normalize volume, I can only offer an example for you, you can optimize the script by yourself, and don't forget to listen the output audio to make sure everything sounds well.
Could you add this script to your pull request?
I added a progressbar and a summary to it, feel free to copy it back. The updated code is here: https://github.com/DanBmh/deepspeech-german/blob/master/data/normalize_noise_audio.py
I added bin/normalize_noise_audio.py
, and did some modifications:
- Removed
typing
for environment compatibility - Fixed pylint error, added warning message for ImportError of
tqdm
&pydub
, because they are not standard packages inrequirement.txt
- Replaced
seconds_to_hours()
withutil/feeding.py::secs_to_hours()
Usage:
python bin/normalize_noise_audio.py --from_dir <directory include noise data> --to_dir <directory to output normalized data>
@mychiux413 anyway we can dump the mixed files and see how effective is the mixing of noise to speech file.just to make sure mixing is proper
@alokprasad You're right, in fact, all the augmented audio should be able to be reviewed in pipeline, even augment on spectrogram like pitch/tempo/mask..., or we would not have a concept to tune the proper parameters.
But in tensorflow's pipeline, it's not as simple as offline augmentation does, we should dump audio data into tensorboard by tf.summary.audio
, and I'm still study this method, also trying to figure out how much refactoring this will affect.
@mychiux413 I also tried to save the audio using tf.print 's output_stream option in following function
"def augment_noise"
noise_ratio = tf.math.pow(10.0, choosen_noise_db / 10)
mixed_audio = tf.multiply(audio, audio_ratio) + tf.multiply(mixed_noise, noise_ratio)
#save to wav file
final_pcm = contrib_audio.encode_wav(mixed_audio,16000)
tf.print(final_pcm,output_stream="file:///tmp/test.wav",summarize=-1)
return mixed_audio
#return tf.multiply(audio, audio_ratio) + tf.multiply(mixed_noise, noise_ratio)
but two problems i am facing
- i am not able to change parameter of output_stream dynamically so that multiple wave file is saved. 2.Files size keeps growing so we have to stop training ctrl+c after few steps.
anyway if i listen the audio , i dont think noise is getting augmented to the speech at all.
@alokprasad I tried tf.print and listened the audio, it's really augmented, maybe my default parameters are too conservative (because some noise data are "speech noise", I don't know what would them cause if too loud), and the process will not augment every single audio time step, but just randomly augment an interval for each audio, and many intervals in noise file are actually silence.
Don't forget to delete test.wav
before each execution, or you will always hear the same output.
Try an extreme example: --audio_aug_mix_noise_max_noise_db=5
, --audio_aug_mix_noise_min_noise_db=10
, to make sure the noise does exist.
Here is another tip, you can also try --audio_aug_mix_noise_max_audio_db=10
, this could simulate microphone over boosted
sound effect.
@mychiux413 "process will not augment every single audio time step, but just randomly augment an interval for each audio" I think this might not produce good result , i think each interval should be mixed with noise.( i.e complete file should be mixed with noise)
Infact it would be good that same audio is fed twice to the network
- mixed with noise
- without noise.
i have added a flag in transcript.csv file with extra flag "noise_flag" whose value is 0 or 1 . eg.csv file will have follwing
wav_filename,wav_filesize,transcript,noise_flag
test1.wav,3423,"where are you?",1
test1.wav,3423,"where are you?",0
1 is to mix noise and 0 donto mix noise.
relevant code changes
if train_phase and noise_iterator :
audio = tf.cond(noise_flag > 0 ,
lambda:augment_noise(
audio,
noise_iterator.get_next(),
change_audio_db_max=FLAGS.audio_aug_mix_noise_max_audio_db,
change_audio_db_min=FLAGS.audio_aug_mix_noise_min_audio_db,
change_noise_db_max=FLAGS.audio_aug_mix_noise_max_noise_db,
change_noise_db_min=FLAGS.audio_aug_mix_noise_min_noise_db,
)
),
lambda:audio)
@alokprasad But while the uniform()
pick a lower noise_ratio like -35 db, I will consider the noise approximately as none, so why should we have to keep "clean audio" for each epoch?
And I checked the Baidu's DeepSpeech add_noise(), they mixed complete file like you said, so I will modify it.
@mychiux413 How are you handling case , if Noise file len is less than speech file len.?
@alokprasad To prevent this, I just repeat the noise file to make the duration over than speech file,
this might make some continuous environment noise
(like street) audio have discontinuous points
, but most of the time, it wont happen, because bin/normalize_noise_audio.py
can split noise files into 30s around or even longer.
And an another repeat reason is, some of point source noise
(like bell) files are short, but they are OK to be repeated.
@mychiux413 , i figured that out, btw what DBFS you think might be good for running training on rnnoise after normalizing.
@mychiux413 one suggestion , i think it would be better to check SNR during training if its too bad dynamic adjust the DBFS so that gain of noise is not too much that it hides speech signal.
@alokprasad
- For current version, The parameters
--audio_aug_mix_noise_max_noise_db -5
,--audio_aug_mix_noise_min_noise_db -35
,--audio_aug_mix_noise_max_audio_db 5
,--audio_aug_mix_noise_min_audio_db -10
could get a good result for me. - Yes, use SNR should be more stable, but considering the performance, I would try to cache every single speech/noise file's DBFS at the beginning, thus we don't have to recalculate every DBFS for each training step to estimate SNR.
- Furthermore, I want to add options to mix
dev/test noises
intodev/test files
, I think it could indicate how strong is the model defending noise instead of just testing clean speech data, also to make sure the model doesn't overfit the specific noise environment. (note:dev/test noise dir
should be different totrain noise dir
).
@mychiux413 is this the change that calculates SNR and adjust gain? https://github.com/mozilla/DeepSpeech/pull/2622/commits/2269514a9ef676100b46f0c99c0e6a7150feb4dd How is the audio generated , did u got chance to see using tf.print outputstream option.
What do you think about using a csv file (formatted like the training csv files) as input instead of a directory. I think in this way you could:
- Use the augmentation features from the training pipeline also with your noise pipeline
- Use the speech data instead of the noise data (cocktail party background noise)
- Maybe use both of the above pipelines for augmentation
@alokprasad I'm still developing, there are still some issues now, please do not use the commit.
@DanBmh I will make the arguments also accept csv files for cocktail party purpose.
And about the augmentation pipelines, here is the data pipeline below
** Current Pipeline
**
[noise] filename -> wav ↴
[ train ] filename -> wav -> mixed audio
↴ -> spectrogram(aug) ↴-> mfcc(aug) -> input
[tensorboard] ___________________audio review
_______ approximate audio review
** Noise Aug Pipeline
**
[noise] filename -> wav -> spectrogram(aug) -> mfcc(aug)↴
[ train ] filename -> wav -> spectrogram(aug) -> mfcc(aug) -> mixed mfcc
-> input
[tensorboard]
This will cause several problems:
- With
Noise Aug Pipeline
, we have to prove the superposition betweenaudio
ormfcc
are equivalent in math, or anyone knows the answer, please reply us. - After spectrogram process, the signal phase term has been lost, if we want to reconstruct the audio back for reviewing purpose, the audio must sound awful (heavy distortion), which is also a major topic in TTS like tacotron, anyway, if we want to dump a clear audio out,
Current Pipeline
should be the best option. - Continuing the last issue above, We can also augment
pitch
,tempo
just inwav process
, butpitch code
would be quite different fromspectrogram_augmentations.py
and might cost large cpu computation duringprefetch phase
due to fft conversion.
Update, specify the dbfs and S/R to determine the balance of audio/noise, and support csv files for cocktail party purpose.
- Now, we can select noise files by directory or csv files with
--audio_aug_mix_noise_train_dirs_or_files
,--audio_aug_mix_noise_dev_dirs_or_files
,--audio_aug_mix_noise_test_dirs_or_files
to validate how tough is the model defending noise. - The final audio volume is random choosed between
--audio_aug_mix_noise_min_audio_dbfs
and--audio_aug_mix_noise_max_audio_dbfs
- The final noise volume is relatively determined by
--audio_aug_mix_noise_min_snr_db
and--audio_aug_mix_noise_max_snr_db
andtarget audio volume
. - Use
--audio_aug_mix_noise_limit_audio_peak_dbfs
and--audio_aug_mix_noise_limit_noise_peak_dbfs
to protect drastic volume variation, if we gain the volume only depends onaverage
dbfs of the audio, the peak part of signal might be drastically over boosted. - Use
--augmentation_review_audio_steps
to listen the augmented audio intensorboard
, but the tensorboard can only exhibit 10 audios in one panel, I don't know how to set it, and tensorboard always normalize the volume of dumped audio, no matter how low is the volume you dumped it, if--summary_dir
is not specified, we can review the augmented audio in default directory:
tensorboard --logdir ~/.share/local/deepspeech/summaries/
-
Do NOT use
--augmentation_review_audio_steps
withspectrogram augmentation
in this commit, because this branch was based on a wrong spectrogram augmentation code, the process will not run correctly. -
An extreme Example to make sure your audio is mixed:
python -u DeepSpeech.py --noshow_progressbar \
--train_files data/ldc93s1/ldc93s1.csv \
--dev_files data/ldc93s1/ldc93s1.csv \
--test_files data/ldc93s1/ldc93s1.csv \
--n_hidden 100 \
--audio_aug_mix_noise_train_dirs_or_files <directory-path1>,<csv-path1>,<directory-path2> \
--audio_aug_mix_noise_dev_dirs_or_files <directory-path1>,<csv-path1>,<directory-path2> \
--audio_aug_mix_noise_test_dirs_or_files <directory-path1>,<csv-path1>,<directory-path2> \
--audio_aug_mix_noise_min_snr_db 0.1 \
--audio_aug_mix_noise_max_snr_db 0.2 \
--audio_aug_mix_noise_min_audio_dbfs -0.2 \
--audio_aug_mix_noise_max_audio_dbfs -0.1 \
--audio_aug_mix_noise_limit_audio_peak_dbfs 100 \
--audio_aug_mix_noise_limit_noise_peak_dbfs 100 \
--augmentation_review_audio_steps 10 \
"$@"
- I set the default parameters for
non speech noise
environment, if we want to train the cocktail party environment, try decreasing the--audio_aug_mix_noise_min_snr_db
and--audio_aug_mix_noise_max_snr_db
.
What do you think about two noise pipelines? One for the noise and one for cocktailparty speech.
I thougth about mixing my files together into one pipeline, but i think it would be better to have seperate mixing parameters for noise and speech. Mostly because you can mix the noise much louder than than the speech while keeping the text understandable.
What do you think about two noise pipelines? One for the noise and one for cocktailparty speech. I thougth about mixing my files together into one pipeline, but i think it would be better to have seperate mixing parameters for noise and speech. Mostly because you can mix the noise much louder than than the speech while keeping the text understandable.
Yes, it make sense, I will try it, but the arguments would be twice than previous version, and how about specifying the number of sub-speakers for each speech? is this helpful for your experiments?
Yes, it make sense, I will try it, but the arguments would be twice than previous version, and how about specifying the number of sub-speakers for each speech? is this helpful for your experiments?
Do you mean augmenting with not only one but multiple background speech or noise files at once? If you dont think its to complicated this is an interesting idea. It would make the backgound noises even more realistic. In this case I would suggest to make the number not fixed, but random with an upper boundary to simulate different environments.
Here’s a question: is it necessary to run augmentation on every epoch? It seems like augmentation is probably more valuable as the model nears convergence. I wonder if you could balance out the performance hit by not augmenting the first x epochs, when the model still has a high WER.
Here’s a question: is it necessary to run augmentation on every epoch? It seems like augmentation is probably more valuable as the model nears convergence. I wonder if you could balance out the performance hit by not augmenting the first x epochs, when the model still has a high WER.
Here is my recent experiment result below (continuing...), I trained 20 epochs for every model with different parameters
- noise file:
rnnoise
,pointsources noise
- train dataset: librivox
clean-100.csv
clean-300.csv
other-500.csv
- test dataset:
test-clean.csv
- the loss records are final step (epoch = 19)
- in addition to this, I also mixed the
zh-tw speech
into librivox, and test the WER.
Name | min_audio_dbfs | max_audio_dbfs | min_snr_db | max_snr_db | limit_audio_peak_dbfs | limit_noise_peak_dbfs | train loss | dev loss | test loss | test wer | test loss (mix TW speech) | test wer (mix TW speech) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Baseline (No Augmentation) | 27.685342 | 24.046401 | 23.756416 | 0.137232 | 121.442734 | 0.454246 | ||||||
Default mix noise | 0 | -35 | 3 | 30 | 7 | 3 | 69.323678 | 21.669104 | 21.383959 | 0.112958 | 60.703743 | 0.270337 |
speech non over boosted | 0 | -35 | 3 | 30 | 0 | 3 | 64.432057 | 21.491052 | 21.344168 | 0.11471 | 60.352631 | 0.261519 |
noise non over boosted | 0 | -35 | 3 | 30 | 7 | 0 | 66.458655 | 21.09868 | 21.09868 | 0.111596 | 62.270283 | 0.269928 |
Wide speech volume | 0 | -45 | 3 | 30 | 7 | 3 | 67.366901 | 21.060449 | 20.68895 | 0.116559 | 59.696766 | 0.2673 |
The result shows:
- whatever the noise parameters are, the tests WER are always better than the test of "No Aug model"
- The performance of defending noise (column
test wer (mix TW speech)
) is very effective with mix noise training - Don't be misled by training loss when mix with noise, because the space of data coverage is large than no-aug.
- To inspect the noise mix training, the parameters might lead some trade off here, if we want to enhance cocktail party speech, you might lose some accuracy in clean test, in my opinion, if we
skip first x epochs
to emphasize the clean environment, which should be equivalent toincrease max SNR
, so the noise test should be worse then.
So my conclusion is:
- Tuning the noise parameters according to your target application environment, which should be equivalent as tuning
skip first x epochs
. - Of course I will also try your idea if I have free resources later.
@mychiux413 How you are generating test samples ,is it natural voice with noisy background or you have mixed clean speech with noise and then using it as test wave?