sherpa-onnx
sherpa-onnx copied to clipboard
[Need help] How to realize Syllable-level Voice Recognition with sherpa-onnx Open Vocabulary Keyword Spotting
I've been always trying to use Sherpa to implement syllable-level speech recognition (1.use a few pinyins to detect hotword directly; 2.or send a long sequence of pinyins to a LLM(gpt or claude) to convert it into the most appropriate Chinese sentence) (https://github.com/k2-fsa/sherpa-ncnn/issues/177)
I found that you released the sherpa-onnx Open Vocabulary Keyword Spotting at 2024-02(https://k2-fsa.github.io/sherpa/onnx/kws/pretrained_models/index.html)
So I imagine I can utilize it to realize syllable-level speech recognition: I've modified the sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/keywords.txt into:
j iǎng @jiang3
y ǒu @you3
b ó @bo2
h uí @hui2
d á @da2
q ǐng @qing3
g ài @gai4
g ē @ge1
Tested it with the AHPUymhd's code(https://github.com/k2-fsa/sherpa-onnx/issues/760),
specify the sound_files = ["./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav"],
(jiang3 you3 bo2 bei4 pai1 dao4 ...)
the output:
jiang3/bo2/bo2
I understand that the "bo2 bo2"(伯伯 uncle) is a more frequently used word than "you3 bo2"(but the Syllable-level Voice Recognition needs to be future-proof and can recognize any new word in the future into pinyins). So I very carefully split the ./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav to ensure each WAV file contains only one syllable:
$ sox 4.wav jiang3.wav trim 0.4 0.33
$ sox 4.wav you3.wav trim 0.77 0.2
$ sox 4.wav bo2.wav trim 1.05 0.25
Now, if I run the python code with "sound_files = ["./sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/jiang3.wav"]", it can correctly output "jiang3", and if I run it with "you3.wav", it can output "you3", if I run it with "bo2.wav", it can say "bo2", It's perfect, even if there're other interfering pinyins(h uí, d á, ...) in keywords.txt file(I am dreaming of adding all 1300 pinyins into it).
So, I guess the sherpa-onnx Open Vocabulary Keyword Spotting is fully capable of perfectly recognizing Chinese mono-syllables, but a method is needed to segment each syllable. Maybe something like silero-vad can do it.
Any idea ?
@danpovey @csukuangfj @pkufool @marcoyang1998