spoken-command-recognition the Synthesized command dataset can work， or not？

Is this project finished successfully？ is there any conclusion about using a synthesized dataset to train a model？ I am thinking about do some similar experiment like this project and hope anybody can give some suggestion. Thx~

Aug 26 '19 09:08 awoniu

Some projects have started using this data set for preliminary work, and you are more than welcome to do so as well (it is on Kaggle too). I myself do not have the expertise to develop elaborate RNNs etc., and am now focusing on other projects.

Aug 26 '19 09:08 JohannesBuchner

Some projects have started using this data set for preliminary work, and you are more than welcome to do so as well (it is on Kaggle too). I myself do not have the expertise to develop elaborate RNNs etc., and am now focusing on other projects.

ok~. I have try to use a synthesized dataset( I make it by using a open source toolkit: soundtouch here is the toolkit's link: http://www.surina.net/soundtouch/ ) to train a RNN(GRU+DNN) model. here is a some preliminary result of my work : I got two command word audio( one is male and the other is female),and I change the pitch speed tempo, and add noise with different SNR level, and finally I got 3 thousands command words audio samples. after the model(GRU+DNN) training seems the model can easily recognize the synthesized command words, but cannot do well in audio from the true world.

Aug 26 '19 09:08 awoniu

That is not overly surprising. Probably you want to use these synthetic data sets to extend real datasets. You can also try to increase the number of speakers, pronunciations and emphasis, as this project does.

Aug 26 '19 10:08 JohannesBuchner

@JohannesBuchner,

Dear sir, any progress about this project, I understand that you're busy with X-ray scanning of extraterrestrial planets, but this github project is also very important for scanning the voice of living creature on this planet.

I found that k2-fsa/sherpa-ncnn(with model "sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23") is very good at 2-syllables recognition of mandarin, but there's no a single-syllable recognition model of 1300 mandarin syllables(pinyin) currently.

I think your project is promiseful and very useful in LLM era, I very much agree your opinion in this project:

I do not need to have my computer "translate" sounds into text, or "understand" a meaning.
I just want to tell my computer a command and it does something. So I only need:soundwaves -> label

so I think that in LLM era, the ASR engines should focus more on the recognition of syllables, and the analysis of vocabulary and sentences should be left to Large Language Models(ChatGPT,Claude etc).

ref: https://github.com/k2-fsa/sherpa-ncnn/issues/177

Sep 16 '23 02:09 diyism

I think you can also extract the recordings of simple words from here: https://commonvoice.mozilla.org/en/datasets and take an architecture like https://github.com/mozilla/DeepSpeech and build a classifier of audio -> label={1,2,3,4,5,other}

Sep 16 '23 21:09 JohannesBuchner

spoken-command-recognition spoken-command-recognition copied to clipboard

the Synthesized command dataset can work， or not？

spoken-command-recognition
spoken-command-recognition copied to clipboard