DeepSpeech
DeepSpeech copied to clipboard
Issue: Missing initial frames causes deepspeech to skip first word, adding some silence about 5ms makes it work most of the time.
For support and discussions, please use our Discourse forums.
If you've found a bug, or have a feature request, then please create an issue with the following information:
- Have I written custom code (as opposed to running examples on an unmodified clone of the repository):
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):18.04
- TensorFlow installed from (our builds, or upstream TensorFlow): Upstream
- TensorFlow version (use command below): Tf1.13
- Python version: 3.7
- Bazel version (if compiling from source):NA
- GCC/Compiler version (if compiling from source):NA
- CUDA/cuDNN version:NA
- GPU model and memory:NA
- Exact command to reproduce:
./deepspeech --model ds-051-model/output_graph.tflite --alphabet ds-051-model/alphabet.txt --lm ds-051-model/lm.binary --trie ds-051-model/trie --audio t3_4507-16021.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=67610
should an hold on the way
./deepspeech --model ds-051-model/output_graph.tflite --alphabet ds-051-model/alphabet.txt --lm ds-051-model/lm.binary --trie ds-051-model/trie --audio st3_4507-16021.wav
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
audio_format=1
num_channels=1
sample_rate=16000
bits_per_sample=16
res.buffer_size=69210
what should one hold on the way
Description: I downloaded samples wav from release folder of deepspeech client and stripped some audio , so that for human hear it still recognizable , but when feeded to ds client recognition do not work for first word eg. should an hold on the way if i added extra silence in this trimmed audio in front , about 800 samples ( 5ms) then recognition for works/close to first word eg after adding silence. what should one hold on the way
@alokprasad For the sake of reproductibility, could you share your trimmed and trimmed+fixed audio samples ?
@alokprasad Ping?
@lissyx Yesterday I tested Mozilla DeepSpeech in both offline (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/native_client/python/client.py) and streaming (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/examples/mic_vad_streaming/mic_vad_streaming.py) modes.
In my experiments I intentionally not used LM / TRIE.
The offline (reading audio from a wav file) mode works quite good concerning speech recognition accuracy. However, I was not able to achieve the same quality via my laptop mic. The speech recognition is very bad if I feed audio via my laptop mic. My initial thought was that my laptop mic drastically changes the frequency response and/or SNR. Also I was suspecting potential issues with VAD in mic_vad_streaming.py.
Then I performed the following experiment:
- I took a wav file, let's name it ORIGINAL_AUDIO.
- If I feed it to client.py, the recognition is quite accurate.
- Then I played ORIGINAL_AUDIO to my laptop speaker and recorded the sound via my laptop mic. Let's name the new wav file as RECORDED_AUDIO.
- Then I fed RECORDED_AUDIO wav file to client.py again, and the recognition was still quite accurate.
Thus I've inferred that the issue is not with my laptop mic.
Then I performed the next experiment: run mic_vad_streaming.py and played ORIGINAL_AUDIO. The recognition results were very bad.
Then I added "--savewav" option and run mic_vad_streaming.py again, and played ORIGINAL_AUDIO. Let's name the saved file as SAVED_AUDIO. I listened to the SAVED_AUDIO. It was good. I was able to hear every word. Then I fed SAVED_AUDIO to client.py (offline mode), and the recognition results were as bad as I got with mic_vad_streaming.py.
After that I found this current issue #2443. Unfortunately, I can not provide particularly ORIGINAL_AUDIO and SAVED_AUDIO files from my experiments I described. However, I've just prepared new files by reproducing the same issue close to as @alokprasad explained. I did the following:
-
took 4507-16021-0012.wav (extracted it from https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/audio-0.5.1.tar.gz)
-
cut "on the way" audio segment in Audacity and saved it as "4507-16021-0012_on_the_way.wav" file
-
generated 0.1 s silence in Audacity in the beginning of "4507-16021-0012_on_the_way.wav" file in Audacity and saved it as "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file
-
fed "4507-16021-0012_on_the_way.wav" file to deepspeech executable:
$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way.wav
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:40:22.269232: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:40:22.269287: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269302: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:40:22.269431: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00903s.
Running inference.
n the way
As you can see, "o" letter is missing in the output (should be "on the way").
- fed "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file to deepspeech executable:
$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-28 22:43:09.572111: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-28 22:43:09.572172: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572196: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-28 22:43:09.572361: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Loaded model in 0.00938s.
Running inference.
on the way
As you can see, this time the recognition is 100% correct. Thus adding a silence in the beginning helped.
I'm suspecting the issue is somewhere in the feature extraction stage (particularly MFCC?). Or the whole system (feature extraction + NN) requires some initial (dummy) set of input samples to start working (filling buffers or something). test_2443.zip
I attached test_2443.zip with "4507-16021-0012_on_the_way.wav" and "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" files.
I hope it would be enough to reproduce the issue and find the cause.
@lissyx Yesterday I tested Mozilla DeepSpeech in both offline (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/native_client/python/client.py) and streaming (https://github.com/mozilla/DeepSpeech/blob/v0.5.1/examples/mic_vad_streaming/mic_vad_streaming.py) modes.
In my experiments I intentionally not used LM / TRIE.
The offline (reading audio from a wav file) mode works quite good concerning speech recognition accuracy. However, I was not able to achieve the same quality via my laptop mic. The speech recognition is very bad if I feed audio via my laptop mic. My initial thought was that my laptop mic drastically changes the frequency response and/or SNR. Also I was suspecting potential issues with VAD in mic_vad_streaming.py.
Then I performed the following experiment:
1. I took a wav file, let's name it ORIGINAL_AUDIO. 2. If I feed it to client.py, the recognition is quite accurate. 3. Then I played ORIGINAL_AUDIO to my laptop speaker and recorded the sound via my laptop mic. Let's name the new wav file as RECORDED_AUDIO. 4. Then I fed RECORDED_AUDIO wav file to client.py again, and the recognition was still quite accurate.
Thus I've inferred that the issue is not with my laptop mic.
Then I performed the next experiment: run mic_vad_streaming.py and played ORIGINAL_AUDIO. The recognition results were very bad.
Then I added "--savewav" option and run mic_vad_streaming.py again, and played ORIGINAL_AUDIO. Let's name the saved file as SAVED_AUDIO. I listened to the SAVED_AUDIO. It was good. I was able to hear every word. Then I fed SAVED_AUDIO to client.py (offline mode), and the recognition results were as bad as I got with mic_vad_streaming.py.
After that I found this current issue #2443. Unfortunately, I can not provide particularly ORIGINAL_AUDIO and SAVED_AUDIO files from my experiments I described. However, I've just prepared new files by reproducing the same issue close to as @alokprasad explained. I did the following:
1. took 4507-16021-0012.wav (extracted it from https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/audio-0.5.1.tar.gz) 2. cut "on the way" audio segment in Audacity and saved it as "4507-16021-0012_on_the_way.wav" file 3. generated 0.1 s silence in Audacity in the beginning of "4507-16021-0012_on_the_way.wav" file in Audacity and saved it as "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file 4. fed "4507-16021-0012_on_the_way.wav" file to deepspeech executable:
$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way.wav Loading model from file deepspeech-0.5.1-models/output_graph.pbmm TensorFlow: v1.13.1-10-g3e0cc53 DeepSpeech: v0.5.1-0-g4b29b78 2019-10-28 22:40:22.269232: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant 2019-10-28 22:40:22.269287: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant 2019-10-28 22:40:22.269302: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant 2019-10-28 22:40:22.269431: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant Loaded model in 0.00903s. Running inference. n the way
As you can see, "o" letter is missing in the output (should be "on the way").
1. fed "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" file to deepspeech executable:
$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --audio audio/4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav Loading model from file deepspeech-0.5.1-models/output_graph.pbmm TensorFlow: v1.13.1-10-g3e0cc53 DeepSpeech: v0.5.1-0-g4b29b78 2019-10-28 22:43:09.572111: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant 2019-10-28 22:43:09.572172: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant 2019-10-28 22:43:09.572196: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant 2019-10-28 22:43:09.572361: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant Loaded model in 0.00938s. Running inference. on the way
As you can see, this time the recognition is 100% correct. Thus adding a silence in the beginning helped.
I'm suspecting the issue is somewhere in the feature extraction stage (particularly MFCC?). Or the whole system (feature extraction + NN) requires some initial (dummy) set of input samples to start working (filling buffers or something). test_2443.zip
I attached test_2443.zip with "4507-16021-0012_on_the_way.wav" and "4507-16021-0012_on_the_way_with_silence_in_the_beginning.wav" files.
I hope it would be enough to reproduce the issue and find the cause.
Honestly, listening before reading your comment, in the version cut, my ears don't get "on" but "n" as well. Once I read your comment, I could only ear "on". I'm unsure how much we are just biaised on that sample, but at least that's actionable.
Honestly, listening before reading your comment, in the version cut, my ears don't get "on" but "n" as well. Once I read your comment, I could only ear "on". I'm unsure how much we are just biaised on that sample, but at least that's actionable.
I've just listened to the original (not modified version) 4507-16021-0012.wav and compared it with my modified wav files. As for me, "on the way" sounds totally equal. When I cut the beginning part of the sentence, I tried to keep "on the way" segment untouched. You could compare all 3 wav files by e.g. Audacity for details (waveform or spectrogram side-by-side).
As for me, "on the way" sounds totally equal.
Ask someone blindly, I'm not sure you will get the same results.
@a-lunev The question here is mostly: is there really something that needs to be addressed at the code-level, i.e., adding some magic constant, or could it just be a side-effect of the datasets we are using, that may mostly have longer-silence than what you are exercizing here.
If it's the later, then the proper solution would not be to workaround in the code but rather improve the training dataset, which might even be easier now that we have data augmentation landed.
@lissyx Sorry for late response here are the samples
-
trimmed
-
silence added to above trimmed file
https://soundcloud.com/alok-prasad-213091558/sets/deepspeech-test-files
actual utterance in the speech file is "why should one hold on the way" but when fed to deepspeech native original wav gives ouput "what should one hold on the way" ( maybe lm issue)
1>Trimmed trimmed_4507-16021-0012.wav TensorFlow: v1.13.1-10-g3e0cc53 DeepSpeech: v0.5.1-0-g4b29b78 audio_format=1 num_channels=1 sample_rate=16000 bits_per_sample=16 res.buffer_size=67714 should an hold on the way
2>When silence is appended silence_added_at_start_and_trimmed_4507-16021-0012.wav TensorFlow: v1.13.1-10-g3e0cc53 DeepSpeech: v0.5.1-0-g4b29b78 audio_format=1 num_channels=1 sample_rate=16000 bits_per_sample=16 res.buffer_size=69210 what should one hold on the way
I think this has to be addressed at training level specially when we have augmentation in place, as most of this ASR will be used in conjunction with some sort of VAD in place ( webrtc or rnnoise) before it detects the speech some frames may be some silence or speech is already lost, if we buffer previous frames have seen issues asr recognition ( specially if we speak very fast )
@a-lunev The question here is mostly: is there really something that needs to be addressed at the code-level, i.e., adding some magic constant, or could it just be a side-effect of the datasets we are using, that may mostly have longer-silence than what you are exercizing here.
If it's the later, then the proper solution would not be to workaround in the code but rather improve the training dataset, which might even be easier now that we have data augmentation landed.
I suppose some debug / investigation is required to determine the real cause of the issue. As soon as the cause is determined, the appropriate decision could be made.
Yep, they was my point 😊
my usecase is wakeword + speech , where my system feeds (streaming) audio to deepspeech to detect wakeword as soon as it is detected next frame onwards it feeds audio to another instance of deepspeech but if gap between wakeword + speech is very less than initial words is missed.
eg. "Lucifer, why should one hold on the way" => ds will recognised it as "should one hold on the way" what is suggestion , should i change the feeder code for augmentation to remove the silence or in fact trim some audio ( few frames ) and do training?
@alokprasad I guess in your case it might be better you change your feeding code yep.
librosa has some silence trimming functionality that could be useful for cleaning up a dataset that has too much silence, if that's what's affecting model performance: https://librosa.github.io/librosa/generated/librosa.effects.trim.html
@reuben amount of silence here is very small .not sure even after removing silence i above issue wont be resolved.Probably with Augmentation we have to chop few initial frames of few samples during training making it robust for ASR.
@reuben , If silence is all zero deepspeech do not work , eg for utterance "Go back " - > Deepspeech gives "back" but if all zero silence of 100ms is added it agains gives back result as "back" if we add random values from 0 to 255 for 100ms silence results comes perfecttly "go back".
Similarly thing audacity does it adds some short of dithering and with that suprisingly deepspeech works better.
Hello . When I downloaded deepspeech and run it on Windows, it unfortunately turned bad speech into text.
For example, when I say: HI ----> i you are ->you are
hello -> halow
How can I increase the accuracy or efficiency of speech to text conversion?
I just want to use only the Deep Speech model and I do not want to teach on any datasets? Is there a way or not?
When I say a word through a microphone, do I already have to make certain settings in the Windows environment?..
Hello . When I downloaded deepspeech and run it on Windows, it unfortunately turned bad speech into text.
For example, when I say: HI ----> i you are ->you are
hello -> halow
How can I increase the accuracy or efficiency of speech to text conversion?
I just want to use only the Deep Speech model and I do not want to teach on any datasets? Is there a way or not?
When I say a word through a microphone, do I already have to make certain settings in the Windows environment?..
Please stop your spam on existing Github issues and use Discourse for support after reading the documentation.
@reuben , If silence is all zero deepspeech do not work , eg for utterance "Go back " - > Deepspeech gives "back" but if all zero silence of 100ms is added it agains gives back result as "back" if we add random values from 0 to 255 for 100ms silence results comes perfecttly "go back".
Similarly thing audacity does it adds some short of dithering and with that suprisingly deepspeech works better.
How are you adding silence to the mic stream?