wenet
wenet copied to clipboard
Streaming inference results are much worse than non-streaming inference results
(1) If I want to use streaming speech recognition while inference, do I have to set use_dynamic_chunk and use_dynamic_left_chunk to True when training the model?
(2) I have tried to set use_dynamic_chunk and use_dynamic_left_chunk to True for training, but from the inference results, the results of streaming inference are much worse than those of non-streaming inference. I use decoding_chunk_size=16, generally what is the reason? ?
I have the same issue. I have set use_dynamic_chunk=True
, and when i train the model, the cv_loss
is much more larger than non-streaming
mode.
I have the same issue. I have set
use_dynamic_chunk=True
, and when i train the model, thecv_loss
is much more larger thannon-streaming
mode.
I train with use_dynamic_chunk and use_dynamic_left_chunk equal to True, the loss decreases normally, but when inference, using the non-streaming recognize method can infer the correct result, but after setting simulate_streaming to True, the inference result is the same as the correct result much worse. I wonder if there is a problem with the function forward_chunk_by_chunk.
@robin1001 Could robin give us some hints ?
Sorry, I have no idea.
@HW140701 I wonder whether we should set use_dynamic_left_chunk=True
to train a streaming model ?
@HW140701 I wonder whether we should set
use_dynamic_left_chunk=True
to train a streaming model ?
I have done this, but the result is not better.
@HW140701 Or maybe should set use_dynamic_left_chunk=False
and give it another try.
@HW140701 Or maybe should set
use_dynamic_left_chunk=False
and give it another try.
I just tried increasing decoding_chunk_size, which is better than setting decoding_chunk_size=16 before.
Same. I trained pinyin version of wenetspeech dataset, got wer 7.23 in test set. But when using dynamic chunk in training, got wer 8.3 at best. But in the paper, dynamic training can even get better result.