wenet icon indicating copy to clipboard operation
wenet copied to clipboard

Adding language model error rate increases rapidly

Open kaiAksenov opened this issue 3 years ago • 4 comments

I used about 30 hours of data to train a model, and the error rate of the model was normal, about 30%(wer), without language model.When I added the language model and decoded it, the error rate was 127% . I thought if I added the language model it would reduce the error rate, but it's not going to reduce it by that much.

The error rate is as high as 120% whether the training data is used to generate a 3-Ngram model or to train a larger language model with additional data. May I ask what is the possible cause of this, thank you

kaiAksenov avatar Jan 03 '22 16:01 kaiAksenov

It's weird. please check:

  1. words.txt: for decoding with LM, you should use the words.txt which is generated by LM tools. Do not use the words.txt which is used for training.
  2. check your lexicon
  3. check your LM.

robin1001 avatar Jan 04 '22 01:01 robin1001

It's weird. please check:

  1. words.txt: for decoding with LM, you should use the words.txt which is generated by LM tools. Do not use the words.txt which is used for training.
  2. check your lexicon
  3. check your LM.

I use the words.txt which is generated by LM tools.there are about 50000 words in words.txt. Language models are also generated by training texts.My data set is on OpenSLR, LM and lexicon should be normal.Have you tested it on a small amount of data ?

kaiAksenov avatar Jan 04 '22 06:01 kaiAksenov

We only test it on 200+ hours dataset. However, I think it must be something wrong in your pipeline, the WER is greater than 100%.

robin1001 avatar Jan 04 '22 11:01 robin1001

u can refer to this: https://github.com/wenet-e2e/wenet/issues/1673

xingchensong avatar Feb 21 '23 05:02 xingchensong

and this : https://github.com/wenet-e2e/wenet/issues/1545

xingchensong avatar Feb 21 '23 05:02 xingchensong