sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

How to create LM for nemo offlne model

Open rohithkodali opened this issue 10 months ago • 1 comments

How to create LM to use for Nemo offline model in nodejs or python version and how to create those HCG graphs for nemo model.

rohithkodali avatar Apr 10 '24 15:04 rohithkodali

There is nothing special for NeMo models.

All models trained by the CTC loss, including but not limited to those from NeMo and icefall and other frameworks, can follow what we have done in icefall to build HLG.fst.

Please see https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/prepare_lm.sh#L97

    if [ ! -f $lang_dir/HL.fst ]; then
      ./local/prepare_lang_fst.py  \
        --lang-dir $lang_dir \
        --ngram-G ./data/lm/G_3_gram.fst.txt
    fi

You need to prepare

  • tokens.txt
  • lexicon.txt
  • words.txt
  • an n-gram arpa file

tokens.txt is already contained in the NeMo model, I think. You can reuse words.txt and the 3-gram arpa file from librispeech if you want.

csukuangfj avatar Apr 11 '24 01:04 csukuangfj