icefall
icefall copied to clipboard
Adding ILM beam search and decoding
This is a Librispeech zipformer recipe using HAT loss from https://github.com/k2-fsa/k2/pull/1244. The recipe includes HAT training, greedy decoding, modified beam search decoding, and subtracting ILM with RNN-LM shallow fusion.
So far, @desh2608 and I have tested this on Librispeech, and the results are similar to regular RNN-LM shallow fusion. However, the intended use of this is adaptation to a new domain with an external RNN-LM trained on that domain.
| Model | Train | Decode | LM scale | ILM scale | test-clean | test-other |
|---|---|---|---|---|---|---|
| Zipformer-HAT | train-960 | greedy_search | - | - | 2.22 | 5.01 |
| modified_beam_search | 0 | 0 | 2.18 | 4.96 | ||
| + RNNLM shallow fusion | 0.29 | 0 | 1.96 | 4.55 | ||
| - ILME | 0.29 | 0.1 | 1.95 | 4.55 | ||
| - ILME | 0.29 | 0.3 | 1.97 | 4.5 |
@AmirHussein96 if you have some time, you can try out the experiment suggested by @marcoyang1998: https://github.com/k2-fsa/icefall/issues/1271#issuecomment-1737530810.
@marcoyang1998 do you have a RNNLM trained on GigaSpeech?
I believe @yfyeung has an RNNLM trained on GigaSpeech. @yfyeung Would you mind sharing one, maybe you can upload it to huggingface?
Yeah, I have RNNLM trained on GigaSpeech but not in icefall style.
https://huggingface.co/yfyeung/icefall-asr-gigaspeech-rnn_lm-2023-10-08
@AmirHussein96 I note that you modified k2.rnnt_loss_pruned in k2. Would you mind sharing your branch?
@AmirHussein96 I note that you modified
k2.rnnt_loss_prunedin k2. Would you mind sharing your branch?
check this: https://github.com/k2-fsa/k2/pull/1244
I conducted benchmarking on the following scenario: Zipformer was initially trained on LibriSpeech and then adapted to Gigaspeech using text only. For the adaptation process, I utilized the Gigaspeech transcripts corresponding to the 1000h, M subset, to train the RNN-LM. Below, you'll find a comparison of various methods: RNN-LM Shallow Fusion (SF), RNN-LM LODR Bigram, and RNN-LM Shallow Fusion integrated with our ILME implementation.
| LM scale | ILM / LODR scale | giga dev | giga test | |
|---|---|---|---|---|
| modified_beam_search (baseline) | 0 | 0 | 20.81 | 19.95 |
| +RNNLM SF | 0.1 | 0 | 20.3 | 19.55 |
| + RNNLM SF | 0.29 | 0 | 19.88 | 19.21 |
| + RNNLM SF | 0.45 | 0 | 20.1 | 19.46 |
| + RNNLM SF LODR(bigram) | 0.45 | 0.16 | 20.42 | 19.6 |
| + RNNLM SF - ILME | 0.29 | 0.1 | 19.7 | 18.96 |
| + RNNLM SF - ILME | 0.45 | 0.1 | 19.54 | 18.89 |
| + RNNLM SF - ILME | 0.29 | 0.2 | 19.84 | 18.99 |
Choice of ILM/LODR and RNNLM weights: ILM:[0.05 0.2] with step of 0.05 LODR:[0.02 0.45] with step of 0.05 RNNLM: [0.05 0.45] with step of 0.05
The configuration for the RNNLM and the training command is as following:
./rnn_lm/train.py \
--world-size 4 \
--exp-dir ./rnn_lm/exp \
--start-epoch 0 \
--num-epochs 30 \
--start-epoch 19 \
--use-fp16 0 \
--tie-weights 1 \
--embedding-dim 512 \
--hidden-dim 512 \
--num-layers 2 \
--batch-size 300 \
--lr 0.0001 \
--lm-data data/lm_training_bpe_500/sorted_lm_data.pt \
--lm-data-valid data/lm_training_bpe_500/sorted_lm_data-valid.pt
RNNLM results on dev: total nll: 776663.5668945312, num tokens: 261759, num sentences: 5715, ppl: 19.435
RNNLM results on test: total nll: 2401851.5998535156, num tokens: 805072, num sentences: 19930, ppl: 19.755
@AmirHussein96 I noticed that you are using a positive scale for LODR, this should be negative. You can check the code here: https://github.com/k2-fsa/icefall/blob/9af144c26b91065a119d4e67c03004974462d24d/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py#L2629-L2634
Would you mind re-running the decoding experiment with LODR, thanks!
icefall/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py
@marcoyang1998 I used the implementation of modified_beam_search_lm_rescore_LODR()below which uses negative weight for LODR https://github.com/k2-fsa/icefall/blob/9af144c26b91065a119d4e67c03004974462d24d/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py#L1563
@marcoyang1998 I tried the modified_beam_search_LODR with LODR_scale=-.24 from https://k2-fsa.github.io/icefall/decoding-with-langugage-models/LODR.html and also LODR_scale=-.16 from my best modified_beam_search_lm_rescore_LODR() results.
| beam | LM scale | ILM / LODR scale | giga dev | giga test | |
|---|---|---|---|---|---|
| modified_beam_search (baseline) | 4 | 0 | 0 | 20.81 | 19.95 |
| + RNNLM SF | 4 | 0.1 | 0 | 20.3 | 19.55 |
| + RNNLM SF | 4 | 0.29 | 0 | 19.88 | 19.21 |
| + RNNLM SF | 4 | 0.45 | 0 | 20.1 | 19.46 |
| + RNNLM SF | 12 | 0.29 | 0 | 19.77 | 19.01 |
| + RNNLM lm_rescore_LODR (bigram) | 4 | 0.45 | 0.16 | 20.42 | 19.6 |
| + RNNLM LODR (bigram) | 4 | 0.45 | -0.24 | 19.38 | 18.71 |
| + RNNLM LODR (bigram) | 4 | 0.45 | -0.16 | 19.47 | 18.85 |
| + RNNLM LODR (bigram) | 12 | 0.45 | -0.24 | 19.1 | 18.44 |
| + RNNLM SF - ILME | 4 | 0.29 | 0.1 | 19.7 | 18.96 |
| + RNNLM SF - ILME | 4 | 0.45 | 0.1 | 19.54 | 18.89 |
| + RNNLM SF - ILME | 4 | 0.29 | 0.2 | 19.84 | 18.99 |
| + RNNLM SF - ILME | 12 | 0.45 | 0.1 | 19.21 | 18.57 |
The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py.
The decoding command is below
for method in modified_beam_search_LODR; do
./zipformer_hat/decode.py \
--epoch 40 --avg 16 --use-averaged-model True \
--beam-size 4 \
--exp-dir ./zipformer_hat/exp \
--bpe-model data/lang_bpe_500/bpe.model \
--max-contexts 4 \
--max-states 8 \
--max-duration 800 \
--decoding-method $method \
--use-shallow-fusion 1 \
--lm-type rnn \
--lm-exp-dir rnn_lm/exp \
--lm-epoch 25 \
--lm-scale 0.45 \
--lm-avg 5 \
--lm-vocab-size 500 \
--rnn-lm-embedding-dim 512 \
--rnn-lm-hidden-dim 512 \
--rnn-lm-num-layers 2 \
--tokens-ngram 2 \
--ngram-lm-scale $LODR_scale
done
The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py
Please have a look at #1017 and https://icefall.readthedocs.io/en/latest/decoding-with-langugage-models/index.html for a comparison between different decoding methods with language models.
Another important comment is that the current ILME implementation is Shallow Fusion so it can be used in streaming but LODR is a language model rescoring.
LODR works in both shallow fusion and rescoring. modified_beam_search_LODR is the shallow fusion type LODR and modified_beam_search_lm_rescore_LODR is the rescoring type. You usually need to set a large --beam-size to achieve good results with rescoring-type methods (see https://icefall.readthedocs.io/en/latest/decoding-with-langugage-models/rescoring.html#id3).
Hi, sorry to step into this conversation. I have a question regarding the LM, is there any motivation why it is preferred RNNLM instead of Transformer-based LM for these experiments?
Thanks.
Hi, sorry to step into this conversation. I have a question regarding the LM, is there any motivation why it is preferred RNNLM instead of Transformer-based LM for these experiments?
Thanks.
The primary reason for choosing RNN-LM is its computational efficiency and suitability for streaming applications. Additionally, the improvement from using a Transformer-LM compared to RNN-LM for rescoring is minimal.
@marcoyang1998 I tried the
modified_beam_search_LODRwith LODR_scale=-.24 from https://k2-fsa.github.io/icefall/decoding-with-langugage-models/LODR.html and also LODR_scale=-.16 from my bestmodified_beam_search_lm_rescore_LODR()results.beam LM scale ILM / LODR scale giga dev giga test modified_beam_search (baseline) 4 0 0 20.81 19.95
- RNNLM SF 4 0.1 0 20.3 19.55
- RNNLM SF 4 0.29 0 19.88 19.21
- RNNLM SF 4 0.45 0 20.1 19.46
- RNNLM SF 12 0.29 0 19.77 19.01
- RNNLM lm_rescore_LODR (bigram) 4 0.45 0.16 20.42 19.6
- RNNLM LODR (bigram) 4 0.45 -0.24 19.38 18.71
- RNNLM LODR (bigram) 4 0.45 -0.16 19.47 18.85
- RNNLM LODR (bigram) 12 0.45 -0.24 19.1 18.44
- RNNLM SF - ILME 4 0.29 0.1 19.7 18.96
- RNNLM SF - ILME 4 0.45 0.1 19.54 18.89
- RNNLM SF - ILME 4 0.29 0.2 19.84 18.99
- RNNLM SF - ILME 12 0.45 0.1 19.21 18.57 The LODR results now are much better so I think
modified_beam_search_lm_rescore_LODR()should be removed frombeam_search.py.The decoding command is below
for method in modified_beam_search_LODR; do ./zipformer_hat/decode.py \ --epoch 40 --avg 16 --use-averaged-model True \ --beam-size 4 \ --exp-dir ./zipformer_hat/exp \ --bpe-model data/lang_bpe_500/bpe.model \ --max-contexts 4 \ --max-states 8 \ --max-duration 800 \ --decoding-method $method \ --use-shallow-fusion 1 \ --lm-type rnn \ --lm-exp-dir rnn_lm/exp \ --lm-epoch 25 \ --lm-scale 0.45 \ --lm-avg 5 \ --lm-vocab-size 500 \ --rnn-lm-embedding-dim 512 \ --rnn-lm-hidden-dim 512 \ --rnn-lm-num-layers 2 \ --tokens-ngram 2 \ --ngram-lm-scale $LODR_scale done
@marcoyang1998, you can check the updated table with beam 12. The results in the updated table show very close performance, with slight improvements in LODR over ILME. These results align with the findings presented in LODR paper: https://arxiv.org/pdf/2203.16776.pdf. Additionally, I conducted an MPSSWE statistical test, which indicates that there is no statistically significant difference between LODR and ILME.
| baseline | RNNLM SF | LODR | ILME | |
|---|---|---|---|---|
| RNNLM SF | <0.001 | - | <0.001 | <0.001 |
| LODR | <0.001 | <0.001 | - | 1 |
| ILME | <0.001 | <0.001 | 1 | - |
Great work! Perhaps we can put a note saying that the RNNLM rescoring of paths is not normally recommended, and instead direct people to the appropriate method. Did you see any difference between zipformer with normal RNN-T and zipformer-HAT?
On Thu, Oct 12, 2023 at 10:43 PM Amir Hussein @.***> wrote:
@marcoyang1998 https://github.com/marcoyang1998 I tried the modified_beam_search_LODR with LODR_scale=-.24 from https://k2-fsa.github.io/icefall/decoding-with-langugage-models/LODR.html and also LODR_scale=-.16 from my best modified_beam_search_lm_rescore_LODR() results.
beam LM scale ILM / LODR scale giga dev giga test modified_beam_search (baseline) 4 0 0 20.81 19.95
RNNLM SF 4 0.1 0 20.3 19.55
RNNLM SF 4 0.29 0 19.88 19.21
RNNLM SF 4 0.45 0 20.1 19.46
RNNLM SF 12 0.29 0 19.77 19.01
RNNLM lm_rescore_LODR (bigram) 4 0.45 0.16 20.42 19.6
RNNLM LODR (bigram) 4 0.45 -0.24 19.38 18.71
RNNLM LODR (bigram) 4 0.45 -0.16 19.47 18.85
RNNLM LODR (bigram) 12 0.45 -0.24 19.1 18.44
RNNLM SF - ILME 4 0.29 0.1 19.7 18.96
RNNLM SF - ILME 4 0.45 0.1 19.54 18.89
RNNLM SF - ILME 4 0.29 0.2 19.84 18.99
RNNLM SF - ILME 12 0.45 0.1 19.21 18.57 The LODR results now are much better so I think modified_beam_search_lm_rescore_LODR() should be removed from beam_search.py.
The decoding command is below
for method in modified_beam_search_LODR; do ./zipformer_hat/decode.py
--epoch 40 --avg 16 --use-averaged-model True
--beam-size 4
--exp-dir ./zipformer_hat/exp
--bpe-model data/lang_bpe_500/bpe.model
--max-contexts 4
--max-states 8
--max-duration 800
--decoding-method $method
--use-shallow-fusion 1
--lm-type rnn
--lm-exp-dir rnn_lm/exp
--lm-epoch 25
--lm-scale 0.45
--lm-avg 5
--lm-vocab-size 500
--rnn-lm-embedding-dim 512
--rnn-lm-hidden-dim 512
--rnn-lm-num-layers 2
--tokens-ngram 2
--ngram-lm-scale $LODR_scale done@marcoyang1998 https://github.com/marcoyang1998, you can check the updated table with beam 12. The results in the updated table show very close performance, with slight improvements in LODR over ILME. These results align with the findings presented in LODR paper: https://arxiv.org/pdf/2203.16776.pdf. Additionally, I conducted an MPSSWE statistical test, which indicates that there is no statistically significant difference between LODR and ILME. baseline RNNLM SF LODR ILME RNNLM SF <0.001 - <0.001 <0.001 LODR <0.001 <0.001 - 1 ILME <0.001 <0.001 1 -
— Reply to this email directly, view it on GitHub https://github.com/k2-fsa/icefall/pull/1291#issuecomment-1759751986, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO4TNCYBUGOP73UKHXTX676ZTANCNFSM6AAAAAA5TPKIDM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Did you see any difference between zipformer with normal RNN-T and zipformer-HAT?
Yes we compared zipformer with the zipformer-HAT using greedy and modified beam search, and the performance is almost the same.
Please let me know if any modifications are needed to finalize the merging of the pull request.
Please let me know if any modifications are needed to finalize the merging of the pull request.
@AmirHussein96 this needs the k2 PR (https://github.com/k2-fsa/k2/pull/1244) to be merged first.
@csukuangfj besides ILM, I am also using HAT for joint speaker diarization (with my SURT model), and Amir is using it for joint language ID in code-switched ASR. We will make PRs for those recipes in the coming months, but it would be great to have these ones checked in first.
@marcoyang1998 Could you have a look at this PR?
Could you please add a section about HAT (WERs, training command, decoding command etc.) in RESULTS.md?
I had a glance and left a few comments. The rest looked fine, thanks for the work!
Would you mind uploading your HAT model to huggingface so that other people can try it?
@AmirHussein96 if you have some time, can we make a final push to get this checked in?