Preserve Flashlight and Pyctcdecode beamsearch with Ngram LM

Support Flashlight and Pyctcdecode decoding with pure KenLM and NeMo KenLM Standardize API of CLI inference scripts

Collection: ASR

Changelog

Fix install script install_beamsearch_decoders.sh
Create flashlight_lexicon file during scripts/asr_language_modeling/ngram_lm/train_kenlm.py and tar it with kenlm.bin
Unify parameters for eval_beamsearch_ngram_ctc.py, speech_to_text_eval.py and training -- Get logprobs from Hypothesis -- Use "pyctcdecode" strategy as default beamsearch algorithm denoted as "beam" -- Remove default seq2seq strategy -- Check decoding_type and search_type combinations -- Support empty string in nemo_kenlm_path and word_kenlm_path for beamsearch without LM (ZeroLM)
Fix bug with EncDecHybridRNNTCTCModel in examples/asr/transcribe_speech.py
Support AggregateTokenizer in scripts/asr_language_modeling/ngram_lm/create_lexicon_from_arpa.py

python3 scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py \
model_path=am_model.nemo  \
dataset_manifest=manifest.json  \
preds_output_folder=/tmp   \
ctc_decoding.strategy=flashlight \
ctc_decoding.beam.kenlm_path=am_model.kenlm \
ctc_decoding.beam.beam_size=[4]   \
ctc_decoding.beam.beam_alpha=[0.5]   \
ctc_decoding.beam.beam_beta=[0.5] \
batch_size=32  \
beam_batch_size=1 \
cuda=1

python3 examples/asr/speech_to_text_eval.py  \
model_path=am_model.nemo \ 
dataset_manifest=manifest.json \
decoder_type=ctc  
ctc_decoding.strategy=flashlight \  
ctc_decoding.beam.nemo_kenlm_path=kenlm_model.bin \
ctc_decoding.beam.beam_size=4   \
ctc_decoding.beam.beam_alpha=0.5   \
ctc_decoding.beam.beam_beta=0.5 \
ctc_decoding.beam.flashlight_cfg.lexicon_path=am_model.flashlight_lexicon \ # DEFAULT_TOKEN_OFFSET
ctc_decoding.beam.return_best_hypothesis=true \
batch_size=32  \
output_filename=/tmp/manifest_out.json 
cuda=1

PR Type:

[ V] New Feature
[ ] Bugfix
[ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Additional Information

Related to #9067

Feb 15 '24 07:02 karpnv

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Mar 01 '24 01:03 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

Mar 09 '24 01:03 github-actions[bot]

@karpnv is this being worked on ?

Mar 09 '24 02:03 titu1994

@karpnv i'll provide a review later this week (bandwidth limited)

Mar 13 '24 16:03 tbartley94

@titu1994 I covered half of the table and PR already huge. let's review it first. Then will continue with AggregateTokenizer

Apr 05 '24 06:04 karpnv

Jenkins

Apr 05 '24 06:04 titu1994

Note: eval_beamsearch_ngram_ctc.py and transducer.py I guess also requires changes for hypothesis that was updated recently for log probs.

May 17 '24 12:05 nithinraok

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Jul 08 '24 01:07 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

Jul 15 '24 01:07 github-actions[bot]

Flashlight and Pyctcdecode decoders

Preserve Flashlight and Pyctcdecode beamsearch with Ngram LM

Changelog

Who can review?

Additional Information