wenet icon indicating copy to clipboard operation
wenet copied to clipboard

Streaming inference results are much worse than non-streaming inference results

Open HW140701 opened this issue 1 year ago • 8 comments

(1) If I want to use streaming speech recognition while inference, do I have to set use_dynamic_chunk and use_dynamic_left_chunk to True when training the model?

(2) I have tried to set use_dynamic_chunk and use_dynamic_left_chunk to True for training, but from the inference results, the results of streaming inference are much worse than those of non-streaming inference. I use decoding_chunk_size=16, generally what is the reason? ?

HW140701 avatar Aug 26 '22 07:08 HW140701

I have the same issue. I have set use_dynamic_chunk=True, and when i train the model, the cv_loss is much more larger than non-streaming mode.

OswaldoBornemann avatar Aug 26 '22 08:08 OswaldoBornemann

I have the same issue. I have set use_dynamic_chunk=True, and when i train the model, the cv_loss is much more larger than non-streaming mode.

I train with use_dynamic_chunk and use_dynamic_left_chunk equal to True, the loss decreases normally, but when inference, using the non-streaming recognize method can infer the correct result, but after setting simulate_streaming to True, the inference result is the same as the correct result much worse. I wonder if there is a problem with the function forward_chunk_by_chunk.

HW140701 avatar Aug 26 '22 08:08 HW140701

@robin1001 Could robin give us some hints ?

OswaldoBornemann avatar Aug 27 '22 07:08 OswaldoBornemann

Sorry, I have no idea.

robin1001 avatar Aug 27 '22 08:08 robin1001

@HW140701 I wonder whether we should set use_dynamic_left_chunk=True to train a streaming model ?

OswaldoBornemann avatar Aug 29 '22 02:08 OswaldoBornemann

@HW140701 I wonder whether we should set use_dynamic_left_chunk=True to train a streaming model ?

I have done this, but the result is not better.

HW140701 avatar Aug 29 '22 03:08 HW140701

@HW140701 Or maybe should set use_dynamic_left_chunk=False and give it another try.

OswaldoBornemann avatar Aug 29 '22 07:08 OswaldoBornemann

@HW140701 Or maybe should set use_dynamic_left_chunk=False and give it another try.

I just tried increasing decoding_chunk_size, which is better than setting decoding_chunk_size=16 before.

HW140701 avatar Aug 29 '22 08:08 HW140701

Same. I trained pinyin version of wenetspeech dataset, got wer 7.23 in test set. But when using dynamic chunk in training, got wer 8.3 at best. But in the paper, dynamic training can even get better result.

yileld avatar Oct 09 '22 06:10 yileld