speech-recognition-experiments icon indicating copy to clipboard operation
speech-recognition-experiments copied to clipboard

Request to re-test sherpa-ncnn

Open csukuangfj opened this issue 1 year ago • 4 comments

The model small-2023-01-09 is not our best-performing model.

Please have a look at of our latest streaming zipformer at https://k2-fsa.github.io/sherpa/ncnn/pretrained_models/zipformer-transucer-models.html

They can get a reasonable WER even without an LM and is quite fast.

csukuangfj avatar Apr 24 '23 04:04 csukuangfj

Here is the command for testing

./build/bin/sherpa-ncnn \
  ./sherpa-ncnn-streaming-zipformer-en-2023-02-13/tokens.txt \
  ./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin \
  ./test-files_en_speech_jfk_11s.wav
  1 \
  greedy_search

And here is the result

Disable fp16 for Zipformer encoder
Don't Use GPU. has_gpu: 0, config.use_vulkan_compute: 1
RecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80), model_config=ModelConfig(encoder_param="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.param", encoder_bin="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/encoder_jit_trace-pnnx.ncnn.bin", decoder_param="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.param", decoder_bin="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/decoder_jit_trace-pnnx.ncnn.bin", joiner_param="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.param", joiner_bin="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/joiner_jit_trace-pnnx.ncnn.bin", tokens="./sherpa-ncnn-streaming-zipformer-en-2023-02-13/tokens.txt", encoder num_threads=4, decoder num_threads=4, joiner num_threads=4), decoder_config=DecoderConfig(method="greedy_search", num_active_paths=4), endpoint_config=EndpointConfig(rule1=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=2.4, min_utterance_length=0), rule2=EndpointRule(must_contain_nonsilence=True, min_trailing_silence=1.4, min_utterance_length=0), rule3=EndpointRule(must_contain_nonsilence=False, min_trailing_silence=0, min_utterance_length=20)), enable_endpoint=False)
wav filename: ./test-files_en_speech_jfk_11s.wav
wav duration (s): 11
Started!
Done!
Recognition result for ./test-files_en_speech_jfk_11s.wav
text:  AND SAW MY FELLOW AMERICANS ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY
timestamps: 0.8 1.28 1.44 1.68 1.8 1.92 2 2.12 2.2 2.36 2.52 2.8 4 4.2 4.44 5.76 6.08 6.32 6.6 6.84 7.08 7.36 7.64 8.64 8.8 9.04 9.32 9.6 9.8 10 10.16 10.44 10.76
Elapsed seconds: 1.150 s
Real time factor (RTF): 1.150 / 11.000 = 0.105

csukuangfj avatar Apr 24 '23 04:04 csukuangfj

Note: the above test is run on macOS, but it can also be run on raspberry pi.

csukuangfj avatar Apr 24 '23 04:04 csukuangfj

I will test the new models soon, thanks for mentioning 👍

fquirin avatar Apr 24 '23 08:04 fquirin

Did a quick test-run, results are definitely much better! 😎👍

Some examples:

Old: PLAY HARD WIFE HERSELF DESTRUCTS BY THE TALLICA New: PLAY HARD WIRE TO SELF DISTRACTS BY METELICA (pretty close)

Old: WHOM HE WAY WHO THE TRAIN New: SHOW ME THE WAY FROM NEW YORK TO CHICAGO WITH THE TRAIN (nailed it)

Old: SAID WHEN HE WILL DECREASE New: SAID THE TWO TO TWENTY ONE DEGREES Org.: "Set the heater to 21 degrees" 😑

Do you have instructions how to include language models or maybe a way to add/emphasize custom vocabulary somehow (dynamic graph etc.)?

fquirin avatar Apr 25 '23 19:04 fquirin