sherpa-onnx [Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting

[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting

Open diyism opened this issue 8 months ago • 14 comments

I've been always trying to use Sherpa to implement syllable-level speech recognition (1.use a few pinyins to detect hotword directly; 2.or send a long sequence of pinyins to a LLM(gpt or claude) to convert it into the most appropriate Chinese sentence) (https://github.com/k2-fsa/sherpa-ncnn/issues/177)

I found that you released the sherpa-onnx Open Vocabulary Keyword Spotting at 2024-02(https://k2-fsa.github.io/sherpa/onnx/kws/pretrained_models/index.html)

So I imagine I can utilize it to realize syllable-level speech recognition: I've modified the sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/keywords.txt into:

j iǎng @jiang3
y ǒu @you3
b ó @bo2

h uí @hui2
d á  @da2
q ǐng @qing3
g ài @gai4
g ē  @ge1

Tested it with the AHPUymhd's code(https://github.com/k2-fsa/sherpa-onnx/issues/760), specify the sound_files = ["./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav"], (jiang3 you3 bo2 bei4 pai1 dao4 ...) the output: jiang3/bo2/bo2

I understand that the "bo2 bo2"(伯伯 uncle) is a more frequently used word than "you3 bo2"(but the Syllable-level Voice Recognition needs to be future-proof and can recognize any new word in the future into pinyins). So I very carefully split the ./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav to ensure each WAV file contains only one syllable:

$ sox 4.wav jiang3.wav trim 0.4 0.33
$ sox 4.wav you3.wav trim 0.77 0.2
$ sox 4.wav bo2.wav trim 1.05 0.25

Now, if I run the python code with "sound_files = ["./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/jiang3.wav"]", it can correctly output "jiang3", and if I run it with "you3.wav", it can output "you3", if I run it with "bo2.wav", it can say "bo2", It's perfect, even if there're other interfering pinyins(h uí, d á, ...) in keywords.txt file(I am dreaming of adding all 1300 pinyins into it).

So, I guess the sherpa-onnx Open Vocabulary Keyword Spotting is fully capable of perfectly recognizing Chinese mono-syllables, but a method is needed to segment each syllable. Maybe something like silero-vad can do it.

Any idea ?

@danpovey @csukuangfj @pkufool @marcoyang1998

May 25 '24 20:05 diyism

sherpa-onnx sherpa-onnx copied to clipboard

[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting

sherpa-onnx
sherpa-onnx copied to clipboard