wenet Streaming inference results are much worse than non-streaming inference results

Streaming inference results are much worse than non-streaming inference results

Open HW140701 opened this issue 1 year ago • 8 comments

(1) If I want to use streaming speech recognition while inference, do I have to set use_dynamic_chunk and use_dynamic_left_chunk to True when training the model?

(2) I have tried to set use_dynamic_chunk and use_dynamic_left_chunk to True for training, but from the inference results, the results of streaming inference are much worse than those of non-streaming inference. I use decoding_chunk_size=16, generally what is the reason? ?

Aug 26 '22 07:08 HW140701

I have the same issue. I have set use_dynamic_chunk=True, and when i train the model, the cv_loss is much more larger than non-streaming mode.

Aug 26 '22 08:08 OswaldoBornemann

I have the same issue. I have set use_dynamic_chunk=True, and when i train the model, the cv_loss is much more larger than non-streaming mode.

I train with use_dynamic_chunk and use_dynamic_left_chunk equal to True, the loss decreases normally, but when inference, using the non-streaming recognize method can infer the correct result, but after setting simulate_streaming to True, the inference result is the same as the correct result much worse. I wonder if there is a problem with the function forward_chunk_by_chunk.

Aug 26 '22 08:08 HW140701

@robin1001 Could robin give us some hints ?

Aug 27 '22 07:08 OswaldoBornemann

Sorry, I have no idea.

Aug 27 '22 08:08 robin1001

@HW140701 I wonder whether we should set use_dynamic_left_chunk=True to train a streaming model ?

Aug 29 '22 02:08 OswaldoBornemann

@HW140701 I wonder whether we should set use_dynamic_left_chunk=True to train a streaming model ?

I have done this, but the result is not better.

Aug 29 '22 03:08 HW140701

@HW140701 Or maybe should set use_dynamic_left_chunk=False and give it another try.

Aug 29 '22 07:08 OswaldoBornemann

@HW140701 Or maybe should set use_dynamic_left_chunk=False and give it another try.

I just tried increasing decoding_chunk_size, which is better than setting decoding_chunk_size=16 before.

Aug 29 '22 08:08 HW140701

Same. I trained pinyin version of wenetspeech dataset, got wer 7.23 in test set. But when using dynamic chunk in training, got wer 8.3 at best. But in the paper, dynamic training can even get better result.

Oct 09 '22 06:10 yileld

wenet wenet copied to clipboard

Streaming inference results are much worse than non-streaming inference results

wenet
wenet copied to clipboard