wenet Adding language model error rate increases rapidly

Adding language model error rate increases rapidly

Open kaiAksenov opened this issue 3 years ago • 4 comments

I used about 30 hours of data to train a model, and the error rate of the model was normal, about 30%(wer), without language model.When I added the language model and decoded it, the error rate was 127% . I thought if I added the language model it would reduce the error rate, but it's not going to reduce it by that much.

The error rate is as high as 120% whether the training data is used to generate a 3-Ngram model or to train a larger language model with additional data. May I ask what is the possible cause of this, thank you

Jan 03 '22 16:01 kaiAksenov

It's weird. please check:

words.txt: for decoding with LM, you should use the words.txt which is generated by LM tools. Do not use the words.txt which is used for training.
check your lexicon
check your LM.

Jan 04 '22 01:01 robin1001

It's weird. please check:

words.txt: for decoding with LM, you should use the words.txt which is generated by LM tools. Do not use the words.txt which is used for training.

check your lexicon

check your LM.

I use the words.txt which is generated by LM tools.there are about 50000 words in words.txt. Language models are also generated by training texts.My data set is on OpenSLR, LM and lexicon should be normal.Have you tested it on a small amount of data ?

Jan 04 '22 06:01 kaiAksenov

We only test it on 200+ hours dataset. However, I think it must be something wrong in your pipeline, the WER is greater than 100%.

Jan 04 '22 11:01 robin1001

u can refer to this: https://github.com/wenet-e2e/wenet/issues/1673

Feb 21 '23 05:02 xingchensong

and this : https://github.com/wenet-e2e/wenet/issues/1545

Feb 21 '23 05:02 xingchensong

wenet wenet copied to clipboard

Adding language model error rate increases rapidly

wenet
wenet copied to clipboard