icefall
icefall copied to clipboard
Domain specific Language Model
We are building an ASR using opensource Hindi/English data using k2-fsa/icefall librispeech recipe.We need to build a Language Model for our domain specific data.Please let us know how to go about this. Thank you
We are using pruneless stateless transducer7 recipe.
Hi Sruthi, it was nice to meet you at ICASSP!
I have discussed this with the guys, we are doing 2 things about it:
- One of them (Xiaoyu, I think) will create some howto or documentation about how to do shallow fusion with LODR (which means: divide the LM by a bigram LM estimated on the training transcripts).
- For cases where you have a list of biasing words/phrases: Kangwei/@pkufool will at some point soon make documentation for his implementation of Aho-Corasick and how to use it.
In future, we will have to investigate whether a CTC system interacts better with this kind of LM biasing than RNN-T systems. The guys have already been doing experiments with adding an auxiliary CTC head to the RNN-T system. The RNN-T helps the CTC head learn better (but not vice versa), and I think the CTC WER is nearly as good as the RNN-T one.
Hi @bsshruthi22 , I'm writing documentation for decoding with language models which should be available very soon. I will update here once I made the PR.
Thank you very much Dan and @marcoyang1998 .
Any updates on this @danpovey @marcoyang1998 ?
@bsshruthi22 https://k2-fsa.github.io/icefall/decoding-with-langugage-models/index.html