wenet
wenet copied to clipboard
Adding language model error rate increases rapidly
I used about 30 hours of data to train a model, and the error rate of the model was normal, about 30%(wer), without language model.When I added the language model and decoded it, the error rate was 127% . I thought if I added the language model it would reduce the error rate, but it's not going to reduce it by that much.
The error rate is as high as 120% whether the training data is used to generate a 3-Ngram model or to train a larger language model with additional data. May I ask what is the possible cause of this, thank you
It's weird. please check:
- words.txt: for decoding with LM, you should use the words.txt which is generated by LM tools. Do not use the words.txt which is used for training.
- check your lexicon
- check your LM.
It's weird. please check:
- words.txt: for decoding with LM, you should use the words.txt which is generated by LM tools. Do not use the words.txt which is used for training.
- check your lexicon
- check your LM.
I use the words.txt which is generated by LM tools.there are about 50000 words in words.txt. Language models are also generated by training texts.My data set is on OpenSLR, LM and lexicon should be normal.Have you tested it on a small amount of data ?
We only test it on 200+ hours dataset. However, I think it must be something wrong in your pipeline, the WER is greater than 100%.
u can refer to this: https://github.com/wenet-e2e/wenet/issues/1673
and this : https://github.com/wenet-e2e/wenet/issues/1545