icefall icon indicating copy to clipboard operation
icefall copied to clipboard

Adding ILM beam search and decoding

Open AmirHussein96 opened this issue 2 years ago • 22 comments

This is a Librispeech zipformer recipe using HAT loss from https://github.com/k2-fsa/k2/pull/1244. The recipe includes HAT training, greedy decoding, modified beam search decoding, and subtracting ILM with RNN-LM shallow fusion.

So far, @desh2608 and I have tested this on Librispeech, and the results are similar to regular RNN-LM shallow fusion. However, the intended use of this is adaptation to a new domain with an external RNN-LM trained on that domain.

Model Train Decode LM scale ILM scale test-clean test-other
Zipformer-HAT train-960 greedy_search - - 2.22 5.01
    modified_beam_search 0 0 2.18 4.96
    + RNNLM shallow fusion 0.29 0 1.96 4.55
    - ILME 0.29 0.1 1.95 4.55
    - ILME 0.29 0.3 1.97 4.5

AmirHussein96 avatar Oct 05 '23 01:10 AmirHussein96

@AmirHussein96 if you have some time, you can try out the experiment suggested by @marcoyang1998: https://github.com/k2-fsa/icefall/issues/1271#issuecomment-1737530810.

@marcoyang1998 do you have a RNNLM trained on GigaSpeech?

desh2608 avatar Oct 05 '23 02:10 desh2608

I believe @yfyeung has an RNNLM trained on GigaSpeech. @yfyeung Would you mind sharing one, maybe you can upload it to huggingface?

marcoyang1998 avatar Oct 05 '23 15:10 marcoyang1998

Yeah, I have RNNLM trained on GigaSpeech but not in icefall style.

https://huggingface.co/yfyeung/icefall-asr-gigaspeech-rnn_lm-2023-10-08

yfyeung avatar Oct 08 '23 07:10 yfyeung

@AmirHussein96 I note that you modified k2.rnnt_loss_pruned in k2. Would you mind sharing your branch?

yfyeung avatar Oct 09 '23 08:10 yfyeung

@AmirHussein96 I note that you modified k2.rnnt_loss_pruned in k2. Would you mind sharing your branch?

check this: https://github.com/k2-fsa/k2/pull/1244

desh2608 avatar Oct 09 '23 15:10 desh2608

I conducted benchmarking on the following scenario: Zipformer was initially trained on LibriSpeech and then adapted to Gigaspeech using text only. For the adaptation process, I utilized the Gigaspeech transcripts corresponding to the 1000h, M subset, to train the RNN-LM. Below, you'll find a comparison of various methods: RNN-LM Shallow Fusion (SF), RNN-LM LODR Bigram, and RNN-LM Shallow Fusion integrated with our ILME implementation.

  LM scale ILM / LODR scale giga dev giga test
modified_beam_search (baseline) 0 0 20.81 19.95
+RNNLM SF 0.1 0 20.3 19.55
+ RNNLM SF 0.29 0 19.88 19.21
+ RNNLM SF 0.45 0 20.1 19.46
+ RNNLM SF LODR(bigram) 0.45 0.16 20.42 19.6
+ RNNLM SF - ILME 0.29 0.1 19.7 18.96
+ RNNLM SF - ILME 0.45 0.1 19.54 18.89
+ RNNLM SF - ILME 0.29 0.2 19.84 18.99

Choice of ILM/LODR and RNNLM weights: ILM:[0.05 0.2] with step of 0.05 LODR:[0.02 0.45] with step of 0.05 RNNLM: [0.05 0.45] with step of 0.05

The configuration for the RNNLM and the training command is as following:

./rnn_lm/train.py \
    --world-size 4 \
    --exp-dir ./rnn_lm/exp \
    --start-epoch 0 \
    --num-epochs 30 \
    --start-epoch 19 \
    --use-fp16 0 \
    --tie-weights 1 \
    --embedding-dim 512 \
    --hidden-dim 512 \
    --num-layers 2 \
    --batch-size 300 \
    --lr 0.0001 \
    --lm-data data/lm_training_bpe_500/sorted_lm_data.pt \
    --lm-data-valid data/lm_training_bpe_500/sorted_lm_data-valid.pt

RNNLM results on dev: total nll: 776663.5668945312, num tokens: 261759, num sentences: 5715, ppl: 19.435 RNNLM results on test: total nll: 2401851.5998535156, num tokens: 805072, num sentences: 19930, ppl: 19.755

AmirHussein96 avatar Oct 10 '23 01:10 AmirHussein96

@AmirHussein96 I noticed that you are using a positive scale for LODR, this should be negative. You can check the code here: https://github.com/k2-fsa/icefall/blob/9af144c26b91065a119d4e67c03004974462d24d/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py#L2629-L2634

Would you mind re-running the decoding experiment with LODR, thanks!

marcoyang1998 avatar Oct 10 '23 06:10 marcoyang1998

icefall/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py

@marcoyang1998 I used the implementation of modified_beam_search_lm_rescore_LODR()below which uses negative weight for LODR https://github.com/k2-fsa/icefall/blob/9af144c26b91065a119d4e67c03004974462d24d/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py#L1563

AmirHussein96 avatar Oct 10 '23 12:10 AmirHussein96

@marcoyang1998 I tried the modified_beam_search_LODR with LODR_scale=-.24 from https://k2-fsa.github.io/icefall/decoding-with-langugage-models/LODR.html and also LODR_scale=-.16 from my best modified_beam_search_lm_rescore_LODR() results.

  beam LM scale ILM / LODR scale giga dev giga test
modified_beam_search (baseline) 4 0 0 20.81 19.95
+ RNNLM SF 4 0.1 0 20.3 19.55
+ RNNLM SF 4 0.29 0 19.88 19.21
+ RNNLM SF 4 0.45 0 20.1 19.46
+ RNNLM SF 12 0.29 0 19.77 19.01
           
+ RNNLM lm_rescore_LODR (bigram) 4 0.45 0.16 20.42 19.6
+ RNNLM LODR (bigram) 4 0.45 -0.24 19.38 18.71
+ RNNLM LODR (bigram) 4 0.45 -0.16 19.47 18.85
+ RNNLM LODR (bigram) 12 0.45 -0.24 19.1 18.44
           
+ RNNLM SF - ILME 4 0.29 0.1 19.7 18.96
+ RNNLM SF - ILME 4 0.45 0.1 19.54 18.89
+ RNNLM SF - ILME 4 0.29 0.2 19.84 18.99
+ RNNLM SF - ILME 12 0.45 0.1 19.21 18.57

The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py.

The decoding command is below

for method in modified_beam_search_LODR; do
  ./zipformer_hat/decode.py \
  --epoch 40 --avg 16 --use-averaged-model True \
  --beam-size 4 \
  --exp-dir ./zipformer_hat/exp \
  --bpe-model data/lang_bpe_500/bpe.model \
  --max-contexts 4 \
  --max-states 8 \
  --max-duration 800 \
  --decoding-method $method \
  --use-shallow-fusion 1 \
  --lm-type rnn \
  --lm-exp-dir rnn_lm/exp \
  --lm-epoch 25 \
  --lm-scale 0.45 \
  --lm-avg 5 \
  --lm-vocab-size 500 \
  --rnn-lm-embedding-dim 512 \
  --rnn-lm-hidden-dim 512 \
  --rnn-lm-num-layers 2 \
  --tokens-ngram 2 \
  --ngram-lm-scale $LODR_scale
done

AmirHussein96 avatar Oct 10 '23 14:10 AmirHussein96

The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py

Please have a look at #1017 and https://icefall.readthedocs.io/en/latest/decoding-with-langugage-models/index.html for a comparison between different decoding methods with language models.

Another important comment is that the current ILME implementation is Shallow Fusion so it can be used in streaming but LODR is a language model rescoring.

LODR works in both shallow fusion and rescoring. modified_beam_search_LODR is the shallow fusion type LODR and modified_beam_search_lm_rescore_LODR is the rescoring type. You usually need to set a large --beam-size to achieve good results with rescoring-type methods (see https://icefall.readthedocs.io/en/latest/decoding-with-langugage-models/rescoring.html#id3).

marcoyang1998 avatar Oct 10 '23 14:10 marcoyang1998

Hi, sorry to step into this conversation. I have a question regarding the LM, is there any motivation why it is preferred RNNLM instead of Transformer-based LM for these experiments?

Thanks.

JuanPZuluaga avatar Oct 10 '23 15:10 JuanPZuluaga

Hi, sorry to step into this conversation. I have a question regarding the LM, is there any motivation why it is preferred RNNLM instead of Transformer-based LM for these experiments?

Thanks.

The primary reason for choosing RNN-LM is its computational efficiency and suitability for streaming applications. Additionally, the improvement from using a Transformer-LM compared to RNN-LM for rescoring is minimal.

AmirHussein96 avatar Oct 11 '23 00:10 AmirHussein96

@marcoyang1998 I tried the modified_beam_search_LODR with LODR_scale=-.24 from https://k2-fsa.github.io/icefall/decoding-with-langugage-models/LODR.html and also LODR_scale=-.16 from my best modified_beam_search_lm_rescore_LODR() results.

  beam LM scale ILM / LODR scale giga dev giga test modified_beam_search (baseline) 4 0 0 20.81 19.95

  • RNNLM SF 4 0.1 0 20.3 19.55
  • RNNLM SF 4 0.29 0 19.88 19.21
  • RNNLM SF 4 0.45 0 20.1 19.46
  • RNNLM SF 12 0.29 0 19.77 19.01            
  • RNNLM lm_rescore_LODR (bigram) 4 0.45 0.16 20.42 19.6
  • RNNLM LODR (bigram) 4 0.45 -0.24 19.38 18.71
  • RNNLM LODR (bigram) 4 0.45 -0.16 19.47 18.85
  • RNNLM LODR (bigram) 12 0.45 -0.24 19.1 18.44            
  • RNNLM SF - ILME 4 0.29 0.1 19.7 18.96
  • RNNLM SF - ILME 4 0.45 0.1 19.54 18.89
  • RNNLM SF - ILME 4 0.29 0.2 19.84 18.99
  • RNNLM SF - ILME 12 0.45 0.1 19.21 18.57 The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py.

The decoding command is below

for method in modified_beam_search_LODR; do
  ./zipformer_hat/decode.py \
  --epoch 40 --avg 16 --use-averaged-model True \
  --beam-size 4 \
  --exp-dir ./zipformer_hat/exp \
  --bpe-model data/lang_bpe_500/bpe.model \
  --max-contexts 4 \
  --max-states 8 \
  --max-duration 800 \
  --decoding-method $method \
  --use-shallow-fusion 1 \
  --lm-type rnn \
  --lm-exp-dir rnn_lm/exp \
  --lm-epoch 25 \
  --lm-scale 0.45 \
  --lm-avg 5 \
  --lm-vocab-size 500 \
  --rnn-lm-embedding-dim 512 \
  --rnn-lm-hidden-dim 512 \
  --rnn-lm-num-layers 2 \
  --tokens-ngram 2 \
  --ngram-lm-scale $LODR_scale
done

@marcoyang1998, you can check the updated table with beam 12. The results in the updated table show very close performance, with slight improvements in LODR over ILME. These results align with the findings presented in LODR paper: https://arxiv.org/pdf/2203.16776.pdf. Additionally, I conducted an MPSSWE statistical test, which indicates that there is no statistically significant difference between LODR and ILME.

  baseline RNNLM SF LODR ILME
RNNLM SF <0.001 - <0.001 <0.001
LODR <0.001 <0.001 - 1
ILME <0.001 <0.001 1 -

AmirHussein96 avatar Oct 12 '23 14:10 AmirHussein96

Great work! Perhaps we can put a note saying that the RNNLM rescoring of paths is not normally recommended, and instead direct people to the appropriate method. Did you see any difference between zipformer with normal RNN-T and zipformer-HAT?

On Thu, Oct 12, 2023 at 10:43 PM Amir Hussein @.***> wrote:

@marcoyang1998 https://github.com/marcoyang1998 I tried the modified_beam_search_LODR with LODR_scale=-.24 from https://k2-fsa.github.io/icefall/decoding-with-langugage-models/LODR.html and also LODR_scale=-.16 from my best modified_beam_search_lm_rescore_LODR() results.

beam LM scale ILM / LODR scale giga dev giga test modified_beam_search (baseline) 4 0 0 20.81 19.95

  • RNNLM SF 4 0.1 0 20.3 19.55

  • RNNLM SF 4 0.29 0 19.88 19.21

  • RNNLM SF 4 0.45 0 20.1 19.46

  • RNNLM SF 12 0.29 0 19.77 19.01

  • RNNLM lm_rescore_LODR (bigram) 4 0.45 0.16 20.42 19.6

  • RNNLM LODR (bigram) 4 0.45 -0.24 19.38 18.71

  • RNNLM LODR (bigram) 4 0.45 -0.16 19.47 18.85

  • RNNLM LODR (bigram) 12 0.45 -0.24 19.1 18.44

  • RNNLM SF - ILME 4 0.29 0.1 19.7 18.96

  • RNNLM SF - ILME 4 0.45 0.1 19.54 18.89

  • RNNLM SF - ILME 4 0.29 0.2 19.84 18.99

  • RNNLM SF - ILME 12 0.45 0.1 19.21 18.57 The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py.

The decoding command is below

for method in modified_beam_search_LODR; do ./zipformer_hat/decode.py
--epoch 40 --avg 16 --use-averaged-model True
--beam-size 4
--exp-dir ./zipformer_hat/exp
--bpe-model data/lang_bpe_500/bpe.model
--max-contexts 4
--max-states 8
--max-duration 800
--decoding-method $method
--use-shallow-fusion 1
--lm-type rnn
--lm-exp-dir rnn_lm/exp
--lm-epoch 25
--lm-scale 0.45
--lm-avg 5
--lm-vocab-size 500
--rnn-lm-embedding-dim 512
--rnn-lm-hidden-dim 512
--rnn-lm-num-layers 2
--tokens-ngram 2
--ngram-lm-scale $LODR_scale done

@marcoyang1998 https://github.com/marcoyang1998, you can check the updated table with beam 12. The results in the updated table show very close performance, with slight improvements in LODR over ILME. These results align with the findings presented in LODR paper: https://arxiv.org/pdf/2203.16776.pdf. Additionally, I conducted an MPSSWE statistical test, which indicates that there is no statistically significant difference between LODR and ILME. baseline RNNLM SF LODR ILME RNNLM SF <0.001 - <0.001 <0.001 LODR <0.001 <0.001 - 1 ILME <0.001 <0.001 1 -

— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/pull/1291#issuecomment-1759751986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4TNCYBUGOP73UKHXTX676ZTANCNFSM6AAAAAA5TPKIDM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

danpovey avatar Oct 12 '23 15:10 danpovey

Did you see any difference between zipformer with normal RNN-T and zipformer-HAT?

Yes we compared zipformer with the zipformer-HAT using greedy and modified beam search, and the performance is almost the same.

AmirHussein96 avatar Oct 12 '23 16:10 AmirHussein96

Please let me know if any modifications are needed to finalize the merging of the pull request.

AmirHussein96 avatar Nov 29 '23 20:11 AmirHussein96

Please let me know if any modifications are needed to finalize the merging of the pull request.

@AmirHussein96 this needs the k2 PR (https://github.com/k2-fsa/k2/pull/1244) to be merged first.

@csukuangfj besides ILM, I am also using HAT for joint speaker diarization (with my SURT model), and Amir is using it for joint language ID in code-switched ASR. We will make PRs for those recipes in the coming months, but it would be great to have these ones checked in first.

desh2608 avatar Nov 30 '23 14:11 desh2608

@marcoyang1998 Could you have a look at this PR?

csukuangfj avatar Dec 01 '23 02:12 csukuangfj

Could you please add a section about HAT (WERs, training command, decoding command etc.) in RESULTS.md?

marcoyang1998 avatar Dec 01 '23 02:12 marcoyang1998

I had a glance and left a few comments. The rest looked fine, thanks for the work!

Would you mind uploading your HAT model to huggingface so that other people can try it?

marcoyang1998 avatar Dec 01 '23 02:12 marcoyang1998

@AmirHussein96 if you have some time, can we make a final push to get this checked in?

desh2608 avatar Jun 19 '24 13:06 desh2608